Nota Bene: This file accompanies the empirical study Housing Prices in Spain, nevertheless, the table of contents of this file and the paper do not follow the same order.

# Function will be used to format inline texts.
applyFormat <- function(text) {
  paste('<span style="font-family: Courier New, monospace;">', text, '</span>', sep = "")
}

0. Data Collection

0.1 Foreword

Wikipedia has a list with 1313 Spain cities by population, which will be used as an starting point to acquire information about real estate prices in the country.

The main idea of this project is to understand the key factors that drive the prices of housing. The data about prices and attributes are from fotocasa. The process of data collection is mainly run in python-webscraper.

In a first query of fotocasa, about 23.000 pages were returned. So extracting all those information is not feasible. Instead a subset of cities based on their population will be retrieved from the aforementioned list. From the 1313, 25% will be used to form the sample, which accounts for \(1313 * 0.25 \approx 329\) cities, which need to be random selected if one would like to make general statement for the whole population without biasing results.

As much of Spain population is concentrated in a few large cities, it does not make sense to give all cities equally like weights of being selected. If one believe that the price of housing is the result of supply and demand, and that both is a function of the population, one can use the relative percentage of population of unit i (city i) as probability of selection for that city.

# Import the csv file with infomation of cities in it and create a new column with relative weights of population to be used as weights in the sample function
cities <- as.data.frame(read_csv('./01_RawDatasets/01_SpainCities.csv')) %>% 
  mutate(RelPopulation = Population / sum(Population))

# Creates now a sample without replacement of the dataframe
# We set the random seed to assure reproducibility. 23111992 (my birthday)
set.seed(23111992)
sample <- sample_n(cities, size = ceiling(nrow(cities) * 0.25),
                   replace = FALSE, weight = cities$RelPopulation)
# and save the data to use it in python
#write.csv('02_SampledCities.csv')

# print some values of the new sampled data
datatable(sample, options = list(pageLenght = 10))
#view(sample)

Next is used Python to retrieve data from the site mentioned at the beginning of this section.

0.2. The Webscraping Process

In a first run, the Python-Module selenium was used to query the cities on fotocase.es mentioned in Section 0.1. For each query, the website returns a list over several pages with houses and apartments for the selected city. For each ad returned in the list, the program scraped its url (see Fig. 0.1) that leads the user to another page where the attributes of the apartments are displayed in a more detailed manner and saved each of the urls to a locally based database called attributed.db into a table called REF. The process lasted for 7 days, between 08.05.2023 to 15.05.2023, and a total of around 59.000 urls were stored.

Fig.01 - Fotocasa.es: (1) main page; (2) returned lists of housings; (3) link to the ads.
Fig.01 - Fotocasa.es: (1) main page; (2) returned lists of housings; (3) link to the ads.

In a second run, using requests and BeautifulSoup python modules, the attributes of the housings were obtained. As the website recognized the webscraper as roboter, rotating proxies needed to be used what slowed down the process enormly and, as a consequence, it was not possible to query all the 59,000 urls. To insure randomization in the query process, by each query the list with the urls were randomly shuffled and an url selected. The process lasted for another 13 days.

Due to the number of days between obtaining the 59,000 urls and querying the attributes some housings have been sold, and the urls were not avaible anymore. Nevertheless the program was able to query around 13.000 housings from the website. The attributes that were scraped are displayed in Fig. 0.2. These attributes were first saved to a python dictionary and them added to the database in json format.

Fig.02 - Wescraped Attributes: Figure shows an illustration of the attibutes that were scraped from the website. Note that some attributes might be different from one unit to another.
Fig.02 - Wescraped Attributes: Figure shows an illustration of the attibutes that were scraped from the website. Note that some attributes might be different from one unit to another.

1. Data Preparation

1.1. Matching the database and the Sample

Now, the attributes for the real estates saved in the database need to be assigned a city, and they need to be matched to the the inital dataset called sample, as defined in section 0.1. R has a package named RSQLite that allows the user to handle databases, which will be used to retrieve the information from the locally stored database _attributes.db

# instatiate the database
conn <- dbConnect(SQLite(), './01_RawDatasets/attributes.db')

# the Attributes are saved in a table called ATTR

# Create an send a query to obtain the attributes from the table ATTR
query <- 'SELECT attributes FROM ATTR'

queried <- dbGetQuery(conn, query)

# In order for the user to get an idea about how the data was stored, we will print the first row of the dataframe 'queried'

print(queried[1, 1])
## [1] "{\"1\": {\"url\": \"https://www.fotocasa.es/es/comprar/vivienda/sevilla-capital/calefaccion-parking-jardin-trastero-ascensor-piscina/165669218/d?from=list\", \"city\": \"sevilla-capital\", \"price\": \"99499\", \"discount\": 0, \"elementary\": [\"1 hab.\", \"1 baño\", \"40 m²\", \"3ª Planta\"], \"Tipo de inmueble\": \"Apartamento\", \"Planta\": \"3ª planta\", \"Parking\": \"Privado\", \"Ascensor\": \"Sí\", \"Consumo energía\": \"C999 kW h m² / año\", \"Emisiones\": \"G999 kg CO₂ m² / año\"}}"


The output above is python dictionary converted to Json-format. It has as first key1 an unique ID, say, a number, and the value related to the ID is again a dictionary, which will be addressed to as the inner dictionary. The inner dictionary contains as first (key, value)-pair for all datapoints the url that was scraped; the second pair is the city where the housing is located.2. Next the reader will find another key named price and another named discount. The fifth key-value pair is “elementary” and a list with the attributes found in Fig. 2 where the numbers 1, 2, 3, 4 are circled. These are common to all units. For these attributes, there was no appropriate text to be scraped, the only information about the key was a picture displayed above the information (such as a bed, a shower and so on). Hence, the attribute values were saved to a list and distinguishing between them will be a part of the data-processing. Further characteristics that were scraped are the ones assigned numbers 5-15 in Fig. 1.2. Notice that these characteristics are not uniform for all apartments, so identifying the unique characteristics will also be a part of the work in later moments.

The Extra-Features (No. 16 in Fig. 1.2) were not collected, for they are very random and not housing specifict “characteristics”. For example, the ones just mentioned translate to: Wardrobes, flooring, home appliences, and so on.


"""Data structure: Python Dictionary"""
{ID: {url: "some_url", 
      city: "some_city",
      elementary: "[attr1, attr2, …]",
      named_attribute1: value1,
      named_attribute2: value2,
      .
      .
      .
      }
}

The Json-data stored in the dataframe queried need to be parsed into a format understood by R, this needs to be done piecewise, otherwise R in return an error. The data returned will be in a (named)-matrix form, where the row name is the unit-ID as displayed by the python dictionary above, the column-names are the key-values of the inner-dictionary and the entries are the values in the inner-dictionary.


# Parse JSON with the frist units to in order to create the variable (dataframe)
# to hold the remainder data.
data <- fromJSON(queried$attributes[1])
# Create a dataframe
data <- as.data.frame(do.call(rbind, data))
for (i in 2:nrow(queried)) {
  
  # get each row of the data frame queried at a time and Parse it to R-Format
  queried.df <- fromJSON(queried$attributes[i])
  
  # Transform it to a dataframe
  queried.df <- as.data.frame(do.call(rbind, queried.df[1]))
  
  # append it to the first dataframe
  data <- bind_rows(data, queried.df)
}

# disconnect from the the base
dbDisconnect(conn)
datatable(data, options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

The output obtained is a dataframe with lists inside it. This is a consequence of how the data was saved to a python-dictionary.

Next, a list is created with only the non-numerical parts of the entries in the column elementary. I.e., for the list ["1 hab", 1 baño", "40 m²", "3ª Planta"], the code will return another list ["hab.", "baño", "m", "Planta"], this list will be used to distinguish between the unique components stored in that column.

list_elementary <- data[ , 'elementary']


# create a function to get the strings and apply it 'listwise'
text_extractor <- function(my_list){
  text <- character()
  
  for (single_list in my_list) {
    # substitute the numbers by an empty string
    string <- gsub('[0-9]', "", single_list)
    # trim the returned string, for the case there is any empty space
    string <- trimws(string)
    # add the string to the vector 'text'
    text <- c(text, string)
  }
  
  return(text)
}

text_vector <- unlist(lapply(list_elementary, text_extractor))

# identify the unique values:
print(unique(text_vector))
##  [1] "hab."              "baño"              "m²"               
##  [4] "ª Planta"          "habs."             "baños"            
##  [7] "Bajos"             "m² terreno"        "Principal"        
## [10] "Entresuelo"        "Superior a Planta" "Sótano"           
## [13] "Subsótano"

The output above translates to

  • hab(s).: rooms
  • Planta: First floor;
  • m²: square meters;
  • Bajos: ground floor;
  • m² terreno: Square meters of land;
  • Sótano: Basement;
  • Baño(s): Bathroom(s);
  • Subsótano:Sub-Basement;
  • Principal: ground-floor;
  • Superior a la planta: above the ground floor;

It follows from the output above that the elements stored in the column ‘elementary’ belongs to the ’location of the housing—which will receive the name floor—or to the size of the housing— which wil be called size—or the size of the land—which will be called landsize—or to the number of rooms—called rooms&dmash;or last but not least, the number of bathrooms—which will be called bathrooms.

Now new columns need to be created in the dataframe in order to unlist the column elementary, and the values stored in that column need to be correctly assigned to the new created columns.


keywords_floor <- c('Planta', 'Bajos', 'Principal', 'Entresuelo', 'Superior a Planta', 'Sótano', 'Subsótano')

keywords_rooms <- c('hab.', 'habs.')

keywords_bedrooms <- c('baño', 'baños')
# Create an empty dataframe with the size of the list, with entries as being NA and with columns [room, bathroom, size, landsize, floor]

mapping_df <- data.frame(matrix(NA, nrow = length(list_elementary), ncol = 5))

colnames(mapping_df) <- c('room', 'bathroom', 'size_sqm', 'landsize_sqm', 'floor')

# Iterate over the list_elementary 
for (i in seq_along(list_elementary)) {
  
  # and iterate over the elements of the list
  for (j in seq_along(list_elementary[[i]])) {
    
    # save the element in a variable for comparison
    element <- list_elementary[[i]][j]
    
    if (grepl('hab', element)) {
      
      mapping_df$room[i] <- element
      
    } else if (grepl('baño', element)) {
      
      mapping_df$bathroom[i] <- element
    
      } else if (grepl('m²', element) & !grepl('terreno', element)) {
      
      mapping_df$size_sqm[i] <- element
      
      } else if (grepl('m²', element) & grepl('terreno', element)) {
      
        mapping_df$landsize_sqm[i] <- element
      } else if (any(grepl(paste(keywords_floor, collapse = '|'), 
                           element, ignore.case = TRUE))) {
        mapping_df$floor[i] <- element
    }
    
  }
}

Next the elementary column is dropped from the dataset as it is no longer needed, and the lists stored inside the dataset will be unlisted. Notice how one can get access to the elements inside those lists:

data[1, 1][[1]] # this is how one obtain the element of the list
## [1] "https://www.fotocasa.es/es/comprar/vivienda/sevilla-capital/calefaccion-parking-jardin-trastero-ascensor-piscina/165669218/d?from=list"


# drop the elementary column
data <- data[ , !colnames(data) %in% 'elementary']

# Create a dataframe to store the unlisted values
unlisted_data <- data.frame(matrix(nrow = dim(data)[1], ncol = dim(data)[2]))
colnames(unlisted_data) <- colnames(data)

error <- list(NA, NA)
# Now, iterate over the columns of the data dataframe and save the unlisted values into this new 'unlisted_data' dataframe
for (i in 1:ncol(data)) {
    
  for (j in 1:nrow(data)) {
    tryCatch({
      unlisted_data[j, i] = data[j, i][[1]][[1]]
    }, error = function(e) {
 
    })
  }

}

# create an ID column for the dataframe 

unlisted_data$ID <- seq(nrow(unlisted_data))


# create an ID column for the dataframe mapping_df

mapping_df$ID <- seq(nrow(mapping_df))

# We join both dataframes
data <- full_join(unlisted_data, mapping_df, by = 'ID')

Next a copy of the dataset data named data_copy is created, for the case that any operation to the orginal dataset returns an inappropriate value.


data_copy1 <- data



data <- data_copy1

# drop the url column, as it is no longer needed
data <- data[, -c(1)]

The dataset data and the dataset sample need to be joined, so the cities in the dataset data are assigned their autonomous regions and provinces. The method reads:

  1. Create a new dataset called cities to store the ‘ID’ and the ‘city’ column of the dataset data.
  2. Create a new dataset called municipalities to store the columns ‘ID’ and ‘Municipality’ of the dataset sample.
  3. The entries in the columns cities and municipalities are made comparable by excluding spaces and commas between words and by replacement letters with accents, like á, ó, and ñ by their correspondent latim correspondent a, o, n and adding them to a new column. In the dataframe called cities this new column is called ASCII_Cities and in the dataframe called municipalities this new column is called ASCII_Municipalities.
  4. Compute the string-distance between entry i in the column ASCII_Cities and the entries stored in the column ASCII_Municipalities in the municipalities dataframe. Obtain the the index of the cities with min-distance by using which.min function.
  5. Use the index obtained above to query the corresponding ID in the municipalities dataframe and add this id to a row named matched_ID into the cities-dataframe.
  6. Use the matched_ID column to unite the dataframes cities and sample in a dataframe named joined_frames. And finally use the ID column to unite the datasets data and joined_frames.


# First, create a subtset of dataset `data` with the name of the cities and their IDs

cities <- data[ , c('ID', 'city')]

# And a subset of the dataset `sample` with the name of the cities and their respective IDs
municipalities <- sample[ , c('ID', 'Municipality')]
# First of all, remove words like `capital from this dataframe and hifens`, as these are not found in the names of the cities stored in the dataset municipalities.

# Create a function to transform the words to Latin-Format
word_formatter <- function(text) {
  
  # remove the accents 
  no_accent <- stri_trans_general(text, 'Latin-ASCII')
  # Remove the word 'capital' (relevant for the cities-dataset)
  no_capital <- gsub('capital', "", no_accent)
  # remove any hyphen from the word (relevant for the cities-dataset)
  no_hyphen <- gsub('-', " ", no_capital)
  # remove any space between the words
  no_space <- gsub("\\s+", "", no_hyphen)
  # write the word in lower cases
  lower_case <- tolower(no_space)
  # Remove any ','
  no_comma <- gsub(',', "", lower_case)
  
  return(no_comma)
}

#  save this transformation to a new column (ASCII_Municipality) in the dataset municipalities
municipalities$ASCII_Municipality <- sapply(municipalities$Municipality, word_formatter)

#  apply the same function to the dataset cities
cities$ASCII_Cities <- sapply(cities$city, word_formatter)

The cities in both datasets are matched based on the smallest string-distance between the datapoints.


# We first cretate a new column in the cities dataframe to hold the matched ID's
cities$matched_ID <- NA

# We now iterate over each row in the dataframe cities and match the names
# in the column ASCII_Cities with the names in the column ASCII_Municipality in the dataframe municipalities and store the municipality-ID
for (i in 1:nrow(cities)) {
  distances <- stringdist(cities$ASCII_Cities[i],
                          municipalities$ASCII_Municipality)
  
  # obtain the minimum distance index
  min_distance <- which.min(distances)
  
  # query the index in the data set municipalities and obtain ID
  matched_ID <- municipalities[min_distance, 'ID']
  
  # Store the ID into the dataframe cities
  cities$matched_ID[i] <- matched_ID

}

#view(cities)

Now use this new column matched_ID to join the cities and the sample dataframe.


# copy the sample dataframe
sample_copy <- sample

colnames(sample_copy)[1] <- 'matched_ID'

joined_frames <- left_join(cities, sample_copy, by = 'matched_ID')

In the next snippet we return only non-repeated values and check that each city has been correctly assigned.


view_joined_frames_unique <- joined_frames[!duplicated(joined_frames$city), ]

#view(view_joined_frames_unique)

Notice that the columns [city, ASCII_Cities, matched_ID] of the joined_frames are no further needed. Thus, before joining the datasets, those columns will be dropped.


# Drop the columns mentioned from the dataframe joined_frames
to_be_dropped <- c('city', 'ASCII_Cities', 'matched_ID')

joined_frames <- joined_frames[ , -which(names(joined_frames) %in% to_be_dropped)]

# and join this dataframe to the dataframe 'data' based on the column ID 

data <- left_join(data, joined_frames, by = 'ID')
#view(data)
length(unique(data$Municipality))
## [1] 172
nrow(data)
## [1] 13015


1.2. Add the Dataset about the Mean-Salary Per Person in Spain


Notice that in the dataset that will be imported, cities like La Codesera are stored as Codesera, La. This pattern is repeated across all values in this dataset. First of all, the order of the words will be reversed. This will help with the computation of the min-string-distance between the cities in this dataset, which will be called salary and the dataset municipalities, which is known by now, to store the cities of the column named municipality in the datset sample. This dataset contains the mean-salary per city retrieved from the instituto nacional de estadística (INE) and were also obtained by scrapping-method used python.


salary <- read_csv('./01_RawDatasets/03_Mean_Salary.csv')
## Rows: 7703 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Municipality, Provincia, Renta
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Let us first have a look at the salary dataset.
datatable(salary, options = list(pageLength = 10))


#  create a function to rearrange the words Albuera, La will becomme La Albuera
corrected_cities <- function(element) {
  splits = strsplit(element, ", ")[[1]]
  if (length(splits) == 2){
    paste(splits[2], splits[1], sep = " ")
  } else {
    splits[1]
  }
}

salary$Municipality <- sapply(salary$Municipality, corrected_cities)


One needs to have a little bit more care with the dataset salary as it contains many more cities than the sample or municipality dataset do. Hence, the iteration and comparison of the cities must occcur in the dataset that has the smallest number of cities, because the stringdist() will always return a value, and if the largest dataset is used, some cities that are not found in the smallest dataset will receive any other value found most similar in the smallest dataframe.

In the present case, the itaration and comparison will now occur in the dataset municipalities, and the comparison is made according to cities i in the dataset municipalities and all those cities in the dataset salary. Thus, first, a unique ID for each unique municipality in the dataset salary needs to be created. These ID’s will be the ones that will be passed into a new column into the dataset municipalities, and used to join both datasets.


salary$Municipality <- as.factor(salary$Municipality)

salary$ID <- as.numeric(salary$Municipality)

# exclude any duplicate from the dataset
salary <- salary[!duplicated(salary$ID, ), ]

#  create a column which is the union between the city and the province
salary$MunProv <- paste(salary$Municipality, salary$Provincia, sep = " ")

municipalities <- sample[ , c(1, 2, 4)]

#  do the same for the dataframe municipality
municipalities$MunProv <- paste(municipalities$Municipality, municipalities$Province, sep = " ")

Next, iterate over the dataset municipalities and compute the min-string-distance between its cities-province combination and the cities-province stored in the dataset salary.


municipalities$matched_ID <- NA

for (i in 1:nrow(municipalities)) {
  distance = stringdist(stri_trans_general(municipalities$MunProv[i], 'Latin-ASCII'), stri_trans_general(salary$MunProv, 'Latin-ASCII'), method = 'jw')
  
  min_distance = which.min(distance)
  
  municipalities$matched_ID[i] <- salary$ID[min_distance]
  
}

# Change the column-name from the salary-dataframe from ID to matched ID in order to join both dataframes
colnames(salary)[4] <- 'matched_ID' 

joined_frames <- left_join(municipalities, salary, by = 'matched_ID')


# we now do visual inspection to assure correctness to the assignments
#view(joined_frames)

Here are the words which have been incorrectly assigned:

The first column of the table (ID) contains the unique IDs for the cities in the dataset municipalities. This is the same ID the cities received in the dataset sample. The second column contains the names of the cities in the dataframe municipalities, the third column shows the names that were assigned to them by computing the min distance among all the cities in the dataframe Salary. The fourth column contains the ID from the Salary Dataframe that should have been assigned to the cities in the dataframe municipalities if the assignment were correct.
ID municipalities-dataframe salary-dataframe Correct salary-ID
3 Valencia Palencia 6897
11 Alicante Albatana 137
413 San Quirico de Tarrasa (San Quirze del Vallés) Sant Quirze de Besora 5982
51 Parla Malagón 4970
15 Gijón San Martín de Oscos 2959
184 Rentería (Errenteria) Urnieta 2517
218 Villajoyosa Villatoya 3660
972 Cobeña Cardeña 2046
452 La Eliana Lantadilla 3400
18 La Coruña La Carlota 6
21 Tarassa (Terrassa) La Garriga 6444
585 Cambadons Tapia de Casariego 1465
292 San Antonio Abad (Sant Antoni de Portmany) San Antonio de Benagéber 5902
158 Santurce Santomera 6158
352 Ribarroja del Turia Riola 5520
33 San Sebastián (Donostia) San Sebastián de los Reyes 2285
314 Puebla de Vallbona (La Pobla de Vallbona) Puebla de Valles 3562
264 Alacuás (Alaquàs) Arconada 154
128 San Vicente del Raspeig (Sant Vicent del Raspeig) San Vicente del Valle 5992
86 San Baudilio de Llobregat (Sant Boi de LLobregat) Sant Hipòlit de Voltregà 5905
430 Benicasim (Beincàssim) Benimassot 1060
608 Montroig Montferri 4453
222 Petrel (Petrer) Pedreguer 5102
467 Llodio (Laudio) Okondo (Artziniega) 3754
29 Pamplona (Iruña) Lana 4940
192 Vendrell Vinebre 2449
736 Alcácer(Alcàsser) Ayuela 255
967 Cee Cea 1896
145 Villarreal (Vila-real) Ullastrell 7095
496 Fuenterrabía (Hondarribia) Olaberria 3150
82 Torrente Cofrentes N.A.
398 Alfaz del Pi Alar del Rey 3390
17 Vitoria Ziordia 7567
333 San Juan de Alicante (Sant Joan d’Alacant) Beniardà 5936
101 Avilés Belmonte de Miranda 789
970 Puentedeume Fuentes de Carbajal 2827
884 Puebla de Farnals Puebla de Yeltes 5329
410 Puzol Portilla 5228
275 Jávea Redován 5473
587 Llanes Tapia de Casariego 6393
The values above will be assigned manually:


# Note that this ids are not the ids that were incorrectly assigned from the dataset salary to the municipalities.
# There are the ID's that in the dataset municipalities for that identifies the housings in this dataset. And for these ID's (or housings) there was wrong assignments in the columns 'matched_ID'.
wrong_assigned_ids <- c(3, 11, 413, 51, 15, 184, 218, 972, 452, 18, 21, 585, 292, 158, 352, 33,
                        314, 264, 128, 86, 430, 608, 222, 467, 29, 192, 736, 967, 145,
                        496, 82, 398, 17, 333, 101, 970, 884, 410, 275, 587)

# Correct ids: these are the IDs that should be assigned from the salary dataset to the matched_ID-column in the dataset municipalities. 
correct_ids <- c(6897, 137, 5982, 4970, 2959, 2517, 3660, 2046, 3400, 6, 6444, 1465, 5902, 6158,
                 5520, 2285, 3562, 154, 5992, 5905, 1060, 4453, 5102, 3754, 4940, 2449,
                 255, 1896, 7095, 3150, NA, 3390, 7567, 5936, 789, 2827, 5329, 5228,
                 5473, 6393)

#  iterate over the wrong_assigned IDs and assign the correct ID's to those values:

j = 0 
for (i in wrong_assigned_ids) {
  j = j + 1
  municipalities[municipalities$ID == i, 'matched_ID'] <- correct_ids[j]
}

# Now we join again the salary and municipalities dataframe based on the their matched IDs:
joined_frames <- left_join(municipalities, salary, by = 'matched_ID')

#view(joined_frames)

By visual inspection, the user can be assured that the cities have now been correctly matched. Next, the municipalities dataframe and the salary one have the matched_id in common, which will be used to append the the column renta from the salary dataframe to the municipalities. Recall that the Municipality column in municipalities dataset and in the data-dataset are the same, and is tus used to append the municipalities (a subset of it containing only the column renta and municipalities) to the data-dataset.


# add the renta-column from the dataset salary to the dataset municipalities based on the matched_ID
salary_subset <- salary[, c(3, 4)]
municipalities <- left_join(municipalities, salary_subset, by = 'matched_ID')

# obtain only the columns from municipality that are relevant to us
municipalities_subset <- select(municipalities, 'Municipality', 'Renta')

# Append this subset based on the names of the cities `Municipality` column to our main dataset `data`
data <- left_join(data, municipalities_subset, by = 'Municipality')

data <- as_tibble(data)


The information about the mean-salary per city (renta media) was finally added to the main dataset: data. There are still other datasets that need to be joined to the data one. This will be done in the following sections.


1.3. Add dataset about the Surface Per City

1.3.1. Importing and Cleaning the data: Preliminaries

The next dataset will be imported was retrived (scrapped) from 15MPedia. Among other information, this dataset contains, is the one about the surface of several cities in Spain. We would like to have this information in order to compute the population density for the cities that will be studied in further section. The information imported from 04_Surface will be stored in a dataframe called density_df.

Nota Bene: The dataset also contains information about the density, but this information is based on the population for 2019. The main dataset data contains more updated information, as the population refers to 2022. Given that the surface of a city is unlikely to be changed in these two years (unlike the pop), we can compute the current density based on old-values of the surface.


density_df <- read.csv('./01_RawDatasets/04_SurfacePerCity.csv', sep = ';')

# Notice we only want the first (X), the second (Municipio) and the 7th column (Superficie..km..) of this dataframe

density_df <- select(density_df, X, Municipio, Superficie..km..)

#  rename the columns in order to make it easier for us to access them:
colnames(density_df) <- c('ID', 'Municipality', 'Surface')

# have a look at the dataset
datatable(density_df, options = list(pageLength = 10))

Just as before, the city names will be matched used the approach of the min-string-distance. The dataset municipalities is again created using the sample dataframe, as defined in Section 0.1. In the dataframe called municipalities a new column named mached_ID is created. Next, iteration over the cities occurs and they are matched to the ones stored in the density_df dataframe. The density_df was imported with an ID in it, which will be passed to the matched entries in the matched_ID column of the muncipaplities-dataframe, such that the dataframes can be joined based on these IDs.. In a latter stage the correctness of the assignments needs to be done visually.


# create the municiaplities dataframe
municipalities <- select(sample, ID, Municipality)

# create an empty column to store the matched IDs
municipalities$matched_ID <- NA

for (i in 1:nrow(municipalities)) {
  distances = stringdist(stri_trans_general(density_df$Municipality, 'Latin-ASCII'), 
                         stri_trans_general(municipalities$Municipality[i], 'Latin-ASCII'),
                         method = 'osa')
  
  
  min_distance <- which.min(distances)
  
  municipalities$matched_ID[i] <- density_df[min_distance, 'ID']
  
}

# Bring the cities from the density_df dataset into the municipality, first we will need to renamed the 'ID' column of the density to be able to unite both
colnames(density_df)[1] <- 'matched_ID'

# unit both datsets
joined_frames <- left_join(municipalities, density_df, by = 'matched_ID')

# We now use this joined_frames to viasually inspect whether the cities have been correctly assigned.
#view(joined_frames)

Here are the cities wrongly assigned:

The first column of the table (ID) contains the unique IDs for the cities in the dataset municipalities. This is the same ID the cities received in the dataset sample. The second column contains the names of the cities in the dataframe municipalities, the third column shows the names that were assigned to them by computing the min distance among all the cities in the dataframe density_df. The fourth column contains the ID from the density_df Dataframe that should have been assigned to the cities in the dataframe municipalities if the assignments were correct.
ID municipalities-dataframe density_df-dataframe correct Density-ID
1158 Puebla de la Calzada Puebla de la Reina NA
16 Hospitalet de LLobregat El Prat de LLobregat NA
334 Erandio Grado NA
57 Cádiz Candín NA
413 San Quirico de Tarrasa San Tirso de Abres NA
112 Villanueva y Geltrú (Vilanova i la Geltrú) Villanueva del Rey 4170
174 Blanes Llanes NA
184 Rentería (Errenteria) Méntrida 4337
337 Olesa de Montserrat Ossa de Montiel NA
839 San Fulgencio (Sant Fulgencio) San Asension NA
710 El Grove El Gordo NA
972 Cobeña Cobeta NA
665 Echévarri (Etxebarri) Estépar NA
89 Ceuta Cella NA
8 Palma (Palma de Mallorca) Cala 380
452 La Eliana (L’Eliana) La Olivaa NA
591 Santa Cruz de Bezana Santa Cruz de Mudela NA
124 Granollers Brañosera NA
309 Molins de Rey Olías del Rey NA
292 San Antonio Abad (Sant Antoni de Portmany) Santa Combra 909
516 Masamagrell (Massamagrell) Benamaurel NA
370 San Juan de Aznalfarache San Juan de la Nava NA
850 El Sauzal El Saucejo NA
21 Tarrasa (Terrasa) Tàrrega 1989
158 Santurce (Santurtzi) Santa Fe 3772
537 Benetúser (Benetússer) Bonete NA
1182 Anglés (Anglès) Angüés NA
189 Burjasot (Burjassot) Barjas NA
97 Guecho Nueno NA
563 Ogíjares Mijares NA
314 Puebla de Vallbona (La Pobla de Vallbona) Puebla de Valles 4244
264 Alacuás (Alaquàs) Alcaraz NA
351 Zarauz (Zarautz) Zarazoga NA
555 Canet de Mar Lloret de Mar NA
377 Valle de Egüés (Egües, Eguesibar) Valle de Mena 2748
38 Castellón de la Plana (Castelló de la Plana) Castrejón de la Peña 1091
263 Durango Aranga NA
41 San Cristóbal de La Laguna (La Laguna) San Cristóbal de Entreviñas 1243
387 Arroyo de la Encomienda Baños de la Encina NA
81 Melilla Sevilla NA
86 San Baudilio de Llobregat (Sant Boi de LLobregat). San Martín de la Vega NA
554 Palau-solità i Plegamans Peralta de Calansaz NA
814 Abanto y Ciérvana (Abano Zierbena) La Cierva NA
304 Paiporta Pájara NA
461 El Astillero (Real Astillero) El Sotillo NA
467 Llodio (Laudio) LLíria 3834
167 Alcantarilla Alcántara NA
111 Torremolinos Montemolín NA
736 Alcácer (Alcàsser) Alcocer NA
967 Cee Cea 2531
426 La Zubia La Fueva NA
851 Catral Carral NA
154 Mairena del Aljarafe Mairena del Alcor NA
776 San Antonio de Benagéber (Sant Antoni de Benaixeve) Sant Antoni de Portmany NA
1095 La Puebla de Alfindén La Puebla de Valverde NA
214 Azuqueca de Henares Yunquera de Henares NA
496 Fuenterrabía (Hondarribia) Fuenferrada 4748
425 Archena Marchena NA
231 Barberá del Vallés La Roca del Vallès NA
47 Lérida (LLeida) Mérida 367
360 Vilaseca (Vila-seca) Valseca NA
22 Badalona Baraona NA
398 Alfaz del Pi (L’Alfàs del Pi) Alcalá del Río NA
998 Moncófar (Moncofa) Monóvar NA
155 Figueras (Figueres) Viguera NA
17 Vitoria (Vitoria-Gasteiz) Villoria 219
333 San Juan de Alicante (Sant Joan d’Alacant) San Juan de Plan NA
510 La Algaba La haba NA
346 Ibi Ibias 2304
114 Castelldefels Castelldans NA
422 Humanes de Madrid Linares de Mora NA
88 Fuengirola Quiroga NA
999 Andorra Añora 759
1032 Canet de Berenguer (Canet d’en Berenguer) Canal de Berdún NA
867 Reinosa Reina NA
242 San Pedro de Ribas (Sant Pere de Ribes) San Pedro de Rozados 3545
146 Mollet del Vallés (Mollet del Vallès) Cortes de Pallás NA
237 Lejona (Leioa) Lena NA
486 Castilleja de la Cuesta Castillejo de Iniesta NA
884 Puebla de Farnals (La Pobla de Farnals) Puebla de Valles NA
410 Puzol (Puçol) Ourol NA
624 Vilanova del Camí Vilanova de Sau NA
953 Viladecaballs (Viladecavalls) Valdecaballeros NA
62 Gerona Gerana NA
76 Cornellá de Llobregat El Prat de Llobregat NA
375 Moncada Monda NA
69 Las Rozas de Madrid Las Rozas de Valdearroyol 2485

Differently from the results obtained in Section 1.2. many of the cities in the dataset municipalities could not be found in the dataset density_df. To solve this problem, the following is done:

  1. First, manually assignment of the correct IDs to those cities for which the min-string-distance method failed, but the correct assignment could be found in the dataset density_df by visual inspection.
  2. write little webscraper to retrieve the information about the surface of the cities, which did not exist in the dataframe densifty_df (more about this latter in Section 1.3.2)

- Correct the assignments mannualy:


wrongly_assigned_ids <- c(112, 184, 8, 292, 21, 158, 314, 377, 38, 41, 467,
                         967, 496, 47, 17, 346, 999, 242, 69)

correct_ids <- c(4170, 4337, 380, 909, 1989, 3772, 4244, 2748, 1091, 1243, 3834,
                 2531, 4748, 367, 219, 2304, 759, 3545, 2485)

# Now we iterate over the wrongly assigned IDs and assign the correct ID to the column `matched_ID` in the municipalities dataframe
j = 0
for (i in wrongly_assigned_ids) {
  j = 1 + j
  municipalities[municipalities$ID == i, 'matched_ID'] = correct_ids[j]
}


1.3.2. A small Webscraper: Finding non-avaibable data

Wikipedia normally has detailed information on cities. Notice from Figure 1.1 that the urls that lead the users to that pages are the same but for the name of the cities at the very end. We will use this feature to retrieve information about the cities (surface of the cities) that could not be found in the dataset 04_SurfacePerCity.csv (or density_df.)

First all the (NA)-cities as well as their respective ID’s (from the municipalities dataframe) were stored in a csv file names 04_NASurface.csv.

Fig.1.1 - Querying Wikipedia: Figure shows a screenshot from wikipedia and the associated url to the page. The reader should notice that the url of Olessa de Montserrat and Puebla de la Calzada is only changed by their names.
Fig.1.1 - Querying Wikipedia: Figure shows a screenshot from wikipedia and the associated url to the page. The reader should notice that the url of Olessa de Montserrat and Puebla de la Calzada is only changed by their names.


  1. First import the dataset.


na_cities <- read_csv('./01_RawDatasets/04_NASurface.csv')
## Rows: 68 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): NA_municipality
## dbl (1): ID
## lgl (1): status
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
datatable(na_cities, options = list(pageLength = 10))

The names in parenthesis in the table above are not need. These names are another way of writing the names of that cities (dependent on whether is catalan, basque, valencian and so on.). Hence, in a first step, the names in parenthesis will be deleted, next another column is created multiple word-names cities are joined by ’_’ (underscore).


na_cities <- mutate(na_cities,
                    single_names = gsub(" \\(.*\\)", '', NA_municipality),
                    united_names = gsub(' ', '_', single_names))

Only two columns of this dataframe is needed for the webscraping: The ID, so one can match the cities in a later stage and the united_names.


# keep only the cited columns
na_cities <- select(na_cities, ID, united_names)

Now the process of webscrapping can start:


base_url <- "https://es.wikipedia.org/wiki/Puebla_de_la_Calzada"
#base_url <- "https://es.wikipedia.org/wiki/San_Fulgencio"

content <- read_html(base_url)

# get table with information about the surface
table <- html_nodes(content, ".infobox.geography.vcard")


table_contents <- as.data.frame(html_table(table, fill = TRUE))

#view(table_contents)
datatable(table_contents, options = list(pageLength = 10))

If one inspects the table above, it can be seen that one of the rows in the first column contains the word superficie. This is exactly the information needed. Thus, this process will be repeated for each city in the dataframe na_cities and if the word superficie is found in the first column, the other columns will be checked for the superficie-value (given in km2) and these values will be stored a new column in the dataframe na_cities, that will be named superficie.


na_cities$superficie <- NA
for (i in na_cities$united_names) {
  tryCatch({
    base_url <- paste0('https://es.wikipedia.org/wiki/', i)
      # query url and search for the table containing the information about the surface
    content <- read_html(base_url)
    table <- html_nodes(content, ".infobox.geography.vcard")
    table_contents <- as.data.frame(html_table(table, fill = TRUE))
    # We now use regular expression to check whether the word 'superficie' can be found in the first column of the dataframe
    
    result <- sum(grepl('superficie', table_contents[ , 1], ignore.case = TRUE))
    index <- which.max(grepl('superficie', table_contents[ , 1], ignore.case = TRUE))
    # if the regular expression returns true for any index, than the sum will be either one or more than one. It if its more than one, than we cannot know for sure which row has the information we are looking for. If it is one, than it is an unique row that contains information about the superfice. It if is zero, the request did not work correctly.
    if (result == 1) {
      
      # We now iterate over the columns of the dataframe to make sure we obtain as value a word containing 'km' in it
      for (j in 1:ncol(table_contents)){
        
        if (grepl('km', table_contents[index, j], ignore.case = TRUE) == TRUE) {
          na_cities[na_cities$united_names == i, 'superficie'] <-  table_contents[index, j]
          break
        }
      }
    }
  }, error = function(e) {
    print('Url-not uniquely identifiable')
    
  })
}
## [1] "Url-not uniquely identifiable"
## [1] "Url-not uniquely identifiable"
## [1] "Url-not uniquely identifiable"
datatable(na_cities, options = list(na_cities))

Here are the cities for which an uniquely identifiable url could not be found:


na_cities[is.na(na_cities$superficie),]
## # A tibble: 6 × 3
##      ID united_names  superficie
##   <dbl> <chr>         <chr>     
## 1   839 San_Fulgencio <NA>      
## 2   972 Cobeña        <NA>      
## 3    89 Ceuta         <NA>      
## 4   263 Durango       <NA>      
## 5    81 Melilla       <NA>      
## 6   167 Alcantarilla  <NA>

The information missing is now manually searched 6 cities:

City Superficie
San Fulgencio 19,75 km²
Cobeña 20,8 km²
Cueta 18,5 km²
Durango 10,79 km²
Melilla 12,3 km²
Alcantarilla 16,24 km²


ids <- c(839, 972, 89, 263, 81, 167)
superficies <- c('19,75 km²', '20,8 km²', '18,5 km²', '10,79 km²', '12,3 km²', '16,24 km²')

# We now iterate over the ids and assign the surface to those cities
j = 0
for (i in ids) {
  j = j + 1
  na_cities[na_cities$ID == i, 'superficie'] = superficies[j]
}

datatable(na_cities, options = list(pageLength = 10))

Now that all the infomation required is available, the datasets municipalities and density_df still need to be joined. Recall that for some wrongly assigned values in the column matched_ID a correction has been made.

The other ones which could not be found in the dataset density_df were also not corrected. But they will in a moment. First join the datasets (also the incorrectly assigned values):


joined_frames <- left_join(municipalities, density_df, by = 'matched_ID')

# Now, we need to correct the surface for those values those cities that are stored in the `na_cities` dataframe. That dataset and the municipaliy one share the same ID columns. We will use this to correct the values in the surface-column of the joined dataframe

for (i in na_cities$ID){
  joined_frames[joined_frames$ID == i, 'Surface'] = na_cities[na_cities$ID == i, 'superficie']
}

# Note that we only need to keep the columns 'Surface' and 'Municipality.x' from the joined frame

joined_frames <- select(joined_frames, Surface, Municipality.x)

colnames(joined_frames)[2] <- 'Municipality'

The name of the municipalities in the dataframe joined_frames are exactly the same names stored in the data-dataset. So the column Municipality will be used to join both dataframes.


data <- left_join(data, joined_frames, by = 'Municipality')

# create a copy of the data, in the case an operation returns innapropriate values
data_copy2 <- data
# this is the second time we make a (backup) copy
data <- data_copy2

Finally the surface per city has been added to the main dataset data.

1.4. Add Dataset about Criminality and Unemployment Rate


There still are two further datasets that shall be part of the analysis. One about the criminality and other about the rate of unemployment per province at a province level.

  • Let the focus lie on the dataset about unemployment in the first moment:
First import the dataset and match the provinces with the provinces stored in the sample-dataframe using the string-min-distance. For such, a new dataset called provinces will be created, this stores the same provinces as the ones found in the sample dataframe. The unemployment-rate dataset will be first stored into a dataframe called unemployment_df. The reader can get a glimpse on the dataset by inspecting the output below:


# Import the dataset about the total rate of unemployment per province
unemployment_df <- read.csv('./01_RawDatasets/05_TasaParo.csv', sep = ';', encoding = 'latin1')
datatable(unemployment_df, options = list(pageLength = 10))

Next, so formating will be applied to the dataset above, such as excluding the numbers from the names, and making double-named cities more comparable in order to increase accuracy at the computation of the min-string-distance.

# Exclude the numbers from the names
unemployment_df$Provincias <- gsub("\\d+ ","", unemployment_df$Provincias)

# Note that some entries like alicante has two names Alicante/Alacant. We only want the first one
unemployment_df$Provincias <- gsub('/.*', "", unemployment_df$Provincias)

# Create an ID column in the unemployment dataset
unemployment_df$ID <- as.numeric(as.factor(unemployment_df$Provincias))

# change the name of the province Araba stored in the dataset unemployment to Álava, Lleida to Lérida, which are other names to the same provinces. This will help us by the matching, as this is the name stored in the sample/provincies dataframe
unemployment_df[unemployment_df$Provincias == 'Araba', 'Provincias'] = 'Álava'
unemployment_df[unemployment_df$Provincias == 'Lleida', 'Provincias'] = 'Lérida'
# Create a dataset with the provinces from the sample dataset
provinces <- as.data.frame(unique(sample$Province))
colnames(provinces) <- 'provinces'

# Create an ID column to store the matched ids in the dataset provinces
provinces$ID <- NA
# We now iterate over the dataset provinces and compute the min-distance among all the provinces in the dataset unemployment_df

for (i in 1:nrow(provinces)) {
  distances <- stringdist(unemployment_df$Provincias, provinces$provinces[i], method = 'lcs')
  
  min_distance <- which.min(distances)
  provinces$ID[i] <- unemployment_df$ID[min_distance]
}

# unite both dataframes now based on the ID column
joined_frames <- left_join(provinces, unemployment_df, by = 'ID')

#view(joined_frames)

After checking for the correctness of the assignments, the ID columns is used to obtain the total (unemployment rate) column. Next, this column is added to the provinces dataset.


unemployment_df <- select(unemployment_df, ID, Total)
provinces <- left_join(provinces, unemployment_df, by = 'ID')
colnames(provinces)[c(1, 3)] <- c('Province', 'UnempRate')
provinces <- provinces[, -2]

Notice that the provinces dataframe and the data dataframe share the same name for the provinces, which allows joining both datasets based on the province-names (recall that by now, the dataset provinces already contains the information about the unemployment rate.)


data <- left_join(data, provinces, by = 'Province')

  • Finally the focus lies on adding the dataset about the criminality per province, the last one.

The output below shows how the dataset 06_crimininality is formatted. It will be stored in a variable (dataframe) called criminality_df.


# First import the dataset
criminality_df <- read.csv('./01_RawDataSets/06_Criminality.csv', skip = 5, encoding = 'latin1')
datatable(criminality_df, options = list(pageLength = 10))

Notice that the dataframe is organized as follows:

  • In one row there is the name of the observation unit
  • In the next row there is the description “III. TOTAL INFRACTIONES PENALES” (total criminality) in the first column, and in the second column its associated value.

The second column will be shifted by -1 so the name of the observations and the values about criminality stay in the same row, then the rows containing NA values (or null-strings) in the second column are dropped.


shifted_column <- c(criminality_df$Enero.diciembre.2022[-1], NA)
criminality_df %>% 
  mutate(
    shifted_column = c(Enero.diciembre.2022[-1], NA),
    ) %>% # shitf the second column by -1 unit and create a column named shifted_column
      select(-Enero.diciembre.2022, -X.1) %>%  # drop the rows in do not need anymore
        filter(grepl("^\\d", shifted_column) == TRUE) -> criminality_df  # return only rows with digits
          
          
# Now change the name of the columns 
colnames(criminality_df) <- c('province', 'total_criminality')

# Drop the rows with national total and the foreign total
criminality_df <- filter(criminality_df, !(province == 'Total Nacional' | province == 'En el extranjero'))

Next a little webscraping is done in order to obtain the population of the provinces. The information is stored into the into the criminality dataset. They will be needed in the data-analysis section. The table can be found in wikipedia and is illustrated in Fig. 1.2. Click to be directed to the table


Fig.1.2 - Querying Wikipedia (2): Figure shows a screenshot from wikipedia with the table containing the population pro province in Spain.
Fig.1.2 - Querying Wikipedia (2): Figure shows a screenshot from wikipedia with the table containing the population pro province in Spain.


base_url <- "https://es.wikipedia.org/wiki/Anexo:Provincias_y_ciudades_autónomas_de_España"


content <- read_html(base_url)

# get table with information about the surface
table <- html_nodes(content, 'table')[[1]]


as.data.frame(html_table(table, fill = TRUE)) %>% 
  select(Puesto, Nombre, Población) %>% 
    mutate(
      prov_pop = gsub("\\..*", "", Población),
      prov_pop = gsub("&.*?0", "", prov_pop)
    ) %>% 
      filter(!Puesto == 'Total') %>%  # lets drop the row with the total
        select(-Población) -> provpop_df # and the column pobablación, as we have created another column prov_pop with cleaned data

Now join the datasets provpop_df and the criminality_df. As the only column they have in common is the province names, one need to match them once again uing the method stringdist() function.


# We create a ID column in the dataframe df_criminality to store the matched IDs of the provinces
criminality_df$ID <- NA

# now, for each city in criminality_df, we apply the stringdist function
for (i in 1:nrow(criminality_df)) {

  distances <- stringdist(provpop_df$Nombre, criminality_df$province[i], method = 'lcs') 
  min_distance <- which.min(distances)

  criminality_df$ID[i] <- provpop_df[min_distance, 'Puesto']
}

# we now join both daframes
colnames(provpop_df)[1] <- 'ID'

joined_frames <- left_join(criminality_df, provpop_df, by = 'ID')

# Inpect the dataframe for correctness
#view(joined_frames)
# Assignment was correct for all cities
criminality_df <- left_join(criminality_df, provpop_df, by = 'ID') 

criminality_df <- select(criminality_df, -Nombre)  # select all the columns but the Nombre, otherwise we will have the name of the provinces twice.

# create column with criminality per 100,000
criminality_df <- mutate(criminality_df,
                         crime_rate =((as.numeric(total_criminality) / as.numeric(prov_pop))) * 1e5)

The assignment was correct for all cities. Next, datasets criminality_df will be appended to the province, which contains the unique-provinces as given in the dataset data. Again the datasets are joined using the method of the minimum string distance.


provinces <- as.data.frame(unique(data$Province)) 

colnames(provinces) <- 'provinces'
# Create column to store id
provinces$ID <- NA

Let’s first inspect how the provinces names in both dataframes are:


unique(sort(criminality_df$province))
##  [1] "Albacete"               "Alicante/Alacant"       "Almería"               
##  [4] "Araba/Álava"            "Asturias"               "Ávila"                 
##  [7] "Badajoz"                "Balears (Illes)"        "Barcelona"             
## [10] "Bizkaia"                "Burgos"                 "Cáceres"               
## [13] "Cádiz"                  "Cantabria"              "Castellón/Castelló"    
## [16] "Ceuta"                  "Ciudad Real"            "Córdoba"               
## [19] "Coruña (A)"             "Cuenca"                 "Gipuzkoa"              
## [22] "Girona"                 "Granada"                "Guadalajara"           
## [25] "Huelva"                 "Huesca"                 "Jaén"                  
## [28] "León"                   "Lleida"                 "Lugo"                  
## [31] "Madrid"                 "Málaga"                 "Melilla"               
## [34] "Murcia"                 "Navarra"                "Ourense"               
## [37] "Palencia"               "Palmas (Las)"           "Pontevedra"            
## [40] "Rioja (La)"             "Salamanca"              "Santa Cruz de Tenerife"
## [43] "Segovia"                "Sevilla"                "Soria"                 
## [46] "Tarragona"              "Teruel"                 "Toledo"                
## [49] "Valencia/València"      "Valladolid"             "Zamora"                
## [52] "Zaragoza"


unique(sort(provinces$provinces))
##  [1] "Álava"                  "Albacete"               "Alicante"              
##  [4] "Almería"                "Badajoz"                "Barcelona"             
##  [7] "Burgos"                 "Cáceres"                "Cádiz"                 
## [10] "Cantabria"              "Castellón"              "Ciudad Real"           
## [13] "Córdoba"                "Cuenca"                 "Granada"               
## [16] "Guadalajara"            "Huelva"                 "Huesca"                
## [19] "Jaén"                   "Las Palmas"             "León"                  
## [22] "Lugo"                   "Madrid"                 "Málaga"                
## [25] "Melilla"                "Navarra"                "Palencia"              
## [28] "Pontevedra"             "Principado de Asturias" "Salamanca"             
## [31] "Santa Cruz de Tenerife" "Segovia"                "Sevilla"               
## [34] "Tarragona"              "Teruel"                 "Toledo"                
## [37] "Valencia"               "Valladolid"             "Zamora"                
## [40] "Zaragoza"

Notice that in the first dataset each city has a single name (unlike the second one). Those names are made comparable by changing the names in the dataset criminality_df.


to_be_renamed = c('Araba/Álava', 'Alicante/Alacant', 'Castellón/Castelló', 'Valencia/València', 'Palmas (Las)')
substitute_by = c('Álava', 'Alicante', 'Castellón', 'Valencia', 'Las Palmas')
for (i in 1:length(to_be_renamed)) {
  criminality_df[criminality_df$province == to_be_renamed[i], 'province'] = substitute_by[i]
}

Now compute the min-string-distance:


for (i in provinces$provinces){
  distances <- stringdist(criminality_df$province, i, method = 'lcs')
  min_distance <- which.min(distances)
  
  provinces[provinces$provinces == i, 'ID'] = criminality_df[min_distance, 'ID']
}

# now join the dataframes to inspect for correctness of the assignment
joined_frames <- left_join(provinces, criminality_df, by = 'ID')

#view(joined_frames)

All values have been correctly assigned.


provinces <- left_join(provinces, criminality_df, by = 'ID')
provinces <- select(provinces, -province)
# Now we joint the provinces dataset to the data-datset
colnames(provinces)[1] <- 'Province'
data <- left_join(data, provinces, by = 'Province')

A copy of the dataframe is created for security reasons.

# this is the third backup copy
data_copy3 <- data
data <- data_copy3

Finally the dataset is complete! In the next step the focus will be on cleaning the dataset data and doing some exploratory analysis.


2. Data Analysis

2.1. Data Transformation and Cleaning


In this section the datapoints stored in the dataset data will be cleaned and transformed such, that it can be used for numerical analysis. This includes dropping columns that will not use be used anymore, creating new variables and new factors.

Recall the how the dataset looks like:


datatable(data, options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

These are the columns that we be used:


c('price', 'Tipo de inmueble', 'floor', 'Parking', 'Ascensor', 'Consumo energía', 'Emisiones', 'Antigüedad', 'Estado', 'room', 'bathroom', 'size_sqm', 'Population', 'Municipality', 'Province', 'Region', 'Renta', 'Surface', 'UnempRate', 'crime_rate')

Their names will be changed to:

c('price', 'type', 'floor', 'parking', 'lift', 'energy_con', 'pollution', 'age', 'condition', 'room', 'bathroom', 'size_sqm', 'pop', 'municipality', 'province', 'region', 'mean_salary', 'surface', 'unemp', 'crime_province')


#1 . first select only those columns that will be used in the analysis.
columns_to_be_used <- c('price', 'Tipo de inmueble', 'floor', 'Parking', 'Ascensor', 'Consumo energía', 'Emisiones', 'Antigüedad', 'Estado', 'room', 'bathroom', 'size_sqm', 'Population', 'Municipality', 'Province', 'Region', 'Renta', 'Surface', 'UnempRate', 'crime_rate')

new_names <- c('price', 'type', 'floor', 'parking', 'lift', 'energy_con', 'pollution', 'age', 'condition', 'room', 'bathroom', 'size_sqm', 'pop', 'municipality', 'province', 'region', 'mean_salary', 'surface', 'unemp', 'crime_province')

# we create anlother dataframe to work on the changed version of the dataframe data.
datav1 <- data[,columns_to_be_used]

# next, change the name of the columns.
colnames(datav1) <- new_names

Now notice that for the columns pollution and energy the values are classified from A to G, where A is the best and G is the worst classification. Along with these classifications, there is also the estimated amount of energy consumption and pollution per year. They will be transformed to numerical value as follows:

  • The energy efficiency letters will receive a value i, where \(i = \left[\left(i \big{|}A\right) = 1, \left(i \big{|}B\right) = 2, …,\left(i \big{|}G\right) = 7\right]\)
  • An energy efficiency score will be created as a weighted average of both columns. The score is computed as follows:

    \[ s_{i}=\frac{i_{energy}}{i_{energy}+j_{polluiton}}*\left(kWhm²/año\right)+\frac{j_{pollution}}{i_{energy}+j_{polluiton}}*\left(kgCO₂m²/año\right) \]

    The lower this score is, the more energy efficient is the object.

  • In a next step a distribution with the unique scores is create, and based on this distribution, seven buckets (intervals) are created and a new classification (score) is assigned to the object. Objects with low scores will still be the ones more energy efficient, but now the efficiency is computed relative to the other objects.
energy_letters <- list('A' = 1, 'B' = 2, 'C' = 3, 'D' = 4, 'E' = 5, 'F' = 6, 'G' = 7)

# Let us first identify if there is a column that contains no didigts in it
no_dig_energy <- sum(!grepl("\\d", datav1$energy_con))
no_dig_pollution <- sum(!grepl("\\d", datav1$pollution))

paste('There are [energy consumption; pollution]=[', no_dig_energy, ";", no_dig_pollution, "] datapoints that have no digits in it.", sep = "")
## [1] "There are [energy consumption; pollution]=[165;165] datapoints that have no digits in it."

Datapoints that have no values about energy consumption (energy and pollution) will be dropped from the dataset. As can be seen from the output above, this is the case for 165 objects.


datav1 <- datav1[grepl("\\d", datav1$pollution), ]

# Now we do the following:

# Create a dataset only with the columns for pollution and energy consumption:
# first create an ID column in the dataset datav1
datav1$ID <- seq(nrow(datav1))

df_energy <- select(datav1, energy_con, pollution, ID)

# create another column only with the letter certification and compute weights
df_energy$weights_energy <- NA
df_energy$weights_pollution <- NA

first_letters_energy <- substr(df_energy$energy_con, 1, 1)
first_letters_pollution <- substr(df_energy$pollution, 1, 1)

for (i in 1:length(first_letters_energy)) {
  letter_energy <- first_letters_energy[i]
  letter_pollution <- first_letters_pollution[i]
  number_energy <- energy_letters[[letter_energy]]
  number_pollution <- energy_letters[[letter_pollution]]
  
  # compute the weights for energy consumption
  weight_energy <- (number_energy/(number_pollution + number_energy))
  
  weights_pollution <- 1 - weight_energy
  
  df_energy$weights_energy[i] <- weight_energy
  df_energy$weights_pollution[i] <- weights_pollution
}


Next two columns are created in the above dataframe, where only the numerical values of the columns ‘energy_con’ and ‘pollution’ will be stored.


df_energy %>% 
  mutate(kw_year = as.numeric(gsub("\\D", "", energy_con)),
         co_year = as.numeric(gsub("\\D", "", pollution)),
         energy_score = weights_energy * kw_year + weights_pollution * co_year) %>% 
    select(ID, energy_score) -> df_energy

Now a distribution of the scores is created. As 7 buckets are to be create, their upper bounds are computed as (100/7)*bucket_number of the distribution. Where the first percentile equals \(\frac{100}{7}=14.285714\)


# get unique scores:
quantile_i <- 1/7
probs <- c(quantile_i * 1, quantile_i * 2, quantile_i * 3, quantile_i * 4, quantile_i * 5, quantile_i * 6)
classifier <- quantile(unique(df_energy$energy_score), probs = probs)


Our energy score reads:

Bucket Interpretation Numerical Value
(0; 58.877] Very good 0
(58.877; 98.341) good 1
(98.341; 139.922] normal 2
(139.922; 179.444] acceptable 3
(179.544; 301.143] bad 4
(301.143; 1137.928] very bad 5
(1137.928; \(\infty\)) horrible 6
names(classifier) <- c(0, 1, 2, 3, 4, 5)
df_energy$energy_class <- NA
# now we iterate over the dataset and assign the classifier to those values
for (i in 1:nrow(df_energy)){
  
  logical_list <- sapply(classifier, function(x) x <= df_energy$energy_score[i])
  no_true <- sum(logical_list)
  
 if (no_true < 6) {
   # get the index name with the first false value, this is the energy classification
   index_false <- names(which(!logical_list)[1])
   df_energy$energy_class[i] <- as.numeric(index_false)
 } else {
   df_energy$energy_class[i] <- 6
 }
  
}

df_energy <- select(df_energy, ID, energy_class)

Now the datasets are joined by their ID columns. Here’s the output of the operations above:


datav1 <- left_join(datav1, df_energy, by = 'ID')

datatable(datav1, options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

The columns pollution and energy_con are not needed anymore, so they will be dropped from the dataset. Next, the columns ‘room’, ‘bathroom’, ‘unemp’, ‘surface’, ‘size_sqm’,and ‘mean_salary’ will be cleaned, so that they contain only numerical value. Further, the datatype of the column price needs to be converted from “character” to “numerical”.


# drop columns
datav1 <- select(datav1,-c(pollution, energy_con))

# substitute the commas in the unemp column to points and transform column to numerical values
datav1$unemp <- as.numeric(gsub(",", ".", datav1$unemp))

# we now clean the surface column.
# first remove all non-numerical values
datav1$surface <- gsub("[^0-9,.]", "", datav1$surface)

# now remove points that are being used as thousand-separators
datav1$surface <- gsub("\\.", "", datav1$surface)

# now substitute the commas by points and transform the values to numeric
datav1$surface <- as.numeric(gsub('\\,',".", datav1$surface))

# we now clean the column mean_salary
# remove the point working as thousand separator and transform data to numeric
datav1$mean_salary <- as.numeric(gsub('\\.', "", datav1$mean_salary))

# transform price column to be a numerical value
datav1$price <- as.numeric(datav1$price)

# transform column size_sqm to be numerical
datav1$size_sqm <- as.numeric(gsub('[^0-9,.]', "", datav1$size_sqm))

# transform column bathroom to be numerical
datav1$bathroom <- as.numeric(gsub('[^0-9]',"", datav1$bathroom))

# transform room column to be numerical 
datav1$room <- as.numeric(gsub('[^0-9]',"", datav1$room))

datatable(datav1, options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Notice the information about the surface were only obtained, so that the population density could be computed, which will be done as the next step:


datav1 <- mutate(datav1,
       density_skm = pop/surface)


2.2. Exploratory Analisis: Plotting data

2.2.1 Exploring distributional Features in the data

In this section, the goal is to understand the dataset at hand, we will for such make use of descriptive statistics and plots.

Note that the column lift has only or NA data, meaning the having a lifting system was answered either with yes or no information about it was given at all. In this one case, given the amount of and no NA, the NAs will be raplaced by no.

non_na_data <- table(as.factor(datav1$lift), exclude = NULL)
barplot(c(non_na_data), sum(is.na(datav1$lift)),
        names.arg = c("Sí", "NA"),
        xlab = 'category',
        ylab = 'count'
        )

So the number of si and na are:

non_na_data
## 
##   Sí <NA> 
## 7916 4934

Now, replace the values:

datav1[is.na(datav1$lift), 'lift'] = 'no'
# and now let's change 'sí' by yes
datav1[datav1$lift == 'Sí', 'lift'] = 'yes'

Next, substitute all the NA values in the dataset by the value \(-9\).

for (column in colnames(datav1)) {
  if (nrow(datav1[is.na(datav1[column]), column]) > 0) {
    if (class(datav1[[column]]) == 'character') {
 
      datav1[is.na(datav1[column]), column] = '-9'
    } else {
      datav1[is.na(datav1[column]), column] = -9
    }

  }
}

datatable(datav1, options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

If the reader take a closer look at the column floor, he/she will find some redundancy. Next, this problem will be tackled in order to reduce the number of factors in that column.

unique(datav1$floor)
##  [1] "3ª Planta"            "1ª Planta"            "-9"                  
##  [4] "2ª Planta"            "4ª Planta"            "7ª Planta"           
##  [7] "5ª Planta"            "Bajos"                "6ª Planta"           
## [10] "9ª Planta"            "8ª Planta"            "14ª Planta"          
## [13] "Principal"            "Entresuelo"           "11ª Planta"          
## [16] "Superior a Planta 15" "10ª Planta"           "Sótano"              
## [19] "12ª Planta"           "15ª Planta"           "Subsótano"           
## [22] "13ª Planta"

Notice that Principal and bajos mean the same thing. They are the ground-floor. One normally used for houses and the other for buildings, hence, we will substitute bajos by principal.

datav1[grepl('Bajos', datav1$floor), 'floor'] = 'Principal'

Note that:

subsótono (below-basement) < Sótano (basement)< Entresuelo (between basement and ground-floor)< Principal < 1 planta < 2 planta … < 15 planta < Superior a la 15 planta (above 15.th floor). It follows that this variable follows a cardinal ordering, and will thus be encoded as such.

vector1 <- c(1)
for (i in 1:15){
  vector1 <- append(vector1, paste0(i, 'ª Planta'))
}
vector1 <-vector1[-1]

vector0 <- c('-9', 'Subsótano', 'Sótano', 'Entresuelo', 'Principal')

vector2 <- c('Superior a Planta 15')

# join the three vectors
ordered_floors <- append(vector0, vector1)
ordered_floors <- append(ordered_floors, vector2)
ordered_floors
##  [1] "-9"                   "Subsótano"            "Sótano"              
##  [4] "Entresuelo"           "Principal"            "1ª Planta"           
##  [7] "2ª Planta"            "3ª Planta"            "4ª Planta"           
## [10] "5ª Planta"            "6ª Planta"            "7ª Planta"           
## [13] "8ª Planta"            "9ª Planta"            "10ª Planta"          
## [16] "11ª Planta"           "12ª Planta"           "13ª Planta"          
## [19] "14ª Planta"           "15ª Planta"           "Superior a Planta 15"

Now transform the column floor to be ordered_factors

datav1$floor <- factor(datav1$floor, ordered = TRUE, levels = ordered_floors)

Notice that the age column follows more or less the same syntax:

unique(datav1$age)
##  [1] "-9"             "50 a 70 años"   "20 a 30 años"   "30 a 50 años"  
##  [5] "Menos de 1 año" "10 a 20 años"   "70 a 100 años"  "5 a 10 años"   
##  [9] "1 a 5 años"     "+ 100 años"

ordering the values yield:

ordered_ages <- c('-9', "Menos de 1 año", '1 a 5 años', "5 a 10 años", "10 a 20 años",
                   "20 a 30 años", "30 a 50 años", "50 a 70 años",
                   "70 a 100 años", "+ 100 años")

datav1$age <- factor(datav1$age, ordered = TRUE, levels = ordered_ages)

datav1$parking <- factor(datav1$parking)

datav1$region <- factor(datav1$region)

datav1$type <- factor(datav1$type)

Now we save this dataframe as the dataframe we will use in the analysis from here on. We call it dataset

dataset <- datav1

#write.csv(dataset, file = './01_RawDatasets/00_CleanedSet.csv')
dataset <- read.csv('./01_RawDatasets/00_CleanedSet.csv')

Next, the distribution of prices is plotted in order to understand the data.

#png('./00_Pictures/02_Processed/06_pricedist.png')
options(scipen = 2, digits = 6)
dens <- density(dataset$price)
plot(dens, col = 'darkblue', ylab = 'Density', xlab = 'Prices in €', main = 'Price-Distribution')
polygon(dens, col = 'lightblue', border = NA)
grid()

#dev.off()

Notice that the distribution of prices resembles much more a \(\chi^{2}\) distribution than a normal one, this is because housing prices are always positive and outliers are not really that ‘seldom’. Some descriptive statistics allows us to obtain a clearer picture of this distribution.

# The distribution
summary(dataset$price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    10000   149900   239000   360380   399000 12500000

The cheapest object in the dataset is an apartment and house, located in Villanueva de Córdoba and Barcelona, respectively:

# look at the min values:

dataset[dataset$price == 10000, ]
##           X price          type floor parking lift age condition room bathroom
## 6543   6543 10000 Casa o chalet    -9      -9   no  -9        -9    1       -9
## 12602 12602 10000          Piso    -9 Privado   no  -9        -9    1        1
##       size_sqm     pop          municipality  province    region mean_salary
## 6543       104    8587 Villanueva de Córdoba   Córdoba Andalucía        9167
## 12602       10 1636193             Barcelona Barcelona  Cataluña       16750
##       surface unemp crime_province    ID energy_class density_skm
## 6543   429.52 17.77        3671.47  6543            5     19.9921
## 12602   99.11 10.15        6412.01 12602            5  16508.9000

While the most expensive object is a house located in Barcelona:

dataset[dataset$price == 12500000,]
##         X    price          type floor parking lift          age  condition
## 2089 2089 12500000 Casa o chalet    -9 Privado   no 30 a 50 años Casi nuevo
##      room bathroom size_sqm     pop municipality  province   region mean_salary
## 2089   10       11     2781 1636193    Barcelona Barcelona Cataluña       16750
##      surface unemp crime_province   ID energy_class density_skm
## 2089   99.11 10.15        6412.01 2089            1     16508.9

Next, the standard deviation and skewness of this distribution are computed:

std <- sqrt(var(dataset$price))
skewness <- (sum((dataset$price - mean(dataset$price))^3) / nrow(dataset)) / (std^3)

paste('skewness: ', skewness, '. Standard Dev.:', std)
## [1] "skewness:  7.07977677236022 . Standard Dev.: 433827.506257726"

A plot of the factor values can also aid to identify the properties of the data.

factor_columns <- c('type', 'floor', 'lift', 'age', 'condition')
par(mfrow = c(2, 1), cex.axis = 0.5, cex.lab = 0.5, cex.main = 0.8)

for (i in factor_columns) {
  counts <- table(dataset[i])
  barplot(counts, xlab = 'Factors', ylab = 'Frequency', main = paste('Variable: ', i, '-Distribution', sep = ""))
  grid()
  barplot(counts, names.arg = names(counts), add = TRUE)
  
}

One see from the plot above that missing values play an important role for several variables.

2.2.2. Mean and Median-Prices per Autononous Comunity and Provinces

In this section the object of study is the distribution of prices per autonomous cities and provinces in order to better understand the spatial differences inside the country.. The shapefiles used in this section come from the website arcgis

(sf_aa_cc <- st_read('./01_SF_Autonomous_Regions/Comunidades_Autonomas_ETRS89_30N.shp',
                    layer = 'Comunidades_Autonomas_ETRS89_30N'))
## Reading layer `Comunidades_Autonomas_ETRS89_30N' from data source 
##   `/Volumes/ExternalNew/Documents/01_Studies/01_University/01_Master/S.S.2023/03_Topics_in_Empirical_Research/03_Project/00_Project_LFE_8105069/01_SF_Autonomous_Regions/Comunidades_Autonomas_ETRS89_30N.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 19 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -1004500 ymin: 3132140 xmax: 1126930 ymax: 4859240
## Projected CRS: ETRS89 / UTM zone 30N
## Simple feature collection with 19 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -1004500 ymin: 3132140 xmax: 1126930 ymax: 4859240
## Projected CRS: ETRS89 / UTM zone 30N
## First 10 features:
##    Codigo                  Texto            Texto_Alt
## 1      01              Andalucía            Andalucía
## 2      02                 Aragón               Aragón
## 3      03 Principado de Asturias             Asturias
## 4      04         Islas Baleares        Illes Balears
## 5      05               Canarias             Canarias
## 6      06              Cantabria            Cantabria
## 7      07        Castilla y León      Castilla y León
## 8      08   Castilla - La Mancha Castilla - La Mancha
## 9      09               Cataluña            Catalunya
## 10     10   Comunidad Valenciana Comunitat Valenciana
##                          geometry
## 1  MULTIPOLYGON (((280487 3993...
## 2  MULTIPOLYGON (((683851 4754...
## 3  MULTIPOLYGON (((271019 4838...
## 4  MULTIPOLYGON (((885505 4299...
## 5  MULTIPOLYGON (((-978944 317...
## 6  MULTIPOLYGON (((478590 4790...
## 7  MULTIPOLYGON (((505342 4715...
## 8  MULTIPOLYGON (((468371 4498...
## 9  MULTIPOLYGON (((825345 4515...
## 10 MULTIPOLYGON (((720291 4227...

Next, the dataset will be groupped by region using the function group_by() and the results is summarized using their means.


unique(dataset$region)
##  [1] "Andalucía"              "Cataluña"               "Comunidad de Madrid"   
##  [4] "País Vasco"             "Galicia"                "Comunidad Valenciana"  
##  [7] "Aragón"                 "Castilla y León"        "Canarias"              
## [10] "Castilla-La Mancha"     "Cana"                   "Cantabria"             
## [13] "Extremadura"            "Melilla"                "Navarra"               
## [16] "Principado de Asturias"

There is a non-existing region named Cana, which is assumed to be an error, and its name should be ‘Canaria’. This can be proofed by checking the cities in that region:

head(select(dataset[dataset$region == 'Cana', ], municipality), 5)
##                    municipality
## 476  Las Palmas de Gran Canaria
## 501      Santa Cruz de Tenerife
## 674  Las Palmas de Gran Canaria
## 945      Santa Cruz de Tenerife
## 1001 Las Palmas de Gran Canaria

It is clear that cities in the output above are in Canarian isles, and will thus be corrected.


dataset[dataset$region == 'Cana', 'region'] = 'Canarias'

dataset[c('price', 'region')] %>% 
  group_by(region) %>% 
    summarize(lower_quarter = quantile(price, probs = c(0.25)),
              median = quantile(price, probs = c(0.5)),
              upper_quarter = quantile(price, probs = c(0.75)),
              mean_price = mean(price),
              n = n()) -> quant_prices_region

groupped_by_region <- group_by(dataset, region)

In order to display this information in a graphical representation, one has to merge the dataset imported from the shapefile and the quant_prices_region. Here again method of the min-string-distance is used. A closer look into the sf_aa_cc-dataframe, shows a column named codigo (code), which uniquely identifies the regions, and will be used to join the dataframes.


# First, create a dataframe with the column codigo and names of the provinces as given in the dataset sf_aa_cc
df_sf <- as.data.frame(sf_aa_cc)

# Create a column in the quant_prices_region to store the (matched IDs)
quant_prices_region$Codigo <- NA

# Lets first take a look at the provinces name in the two datasets.
unique(sort(quant_prices_region$region))
##  [1] "Andalucía"              "Aragón"                 "Canarias"              
##  [4] "Cantabria"              "Castilla y León"        "Castilla-La Mancha"    
##  [7] "Cataluña"               "Comunidad de Madrid"    "Comunidad Valenciana"  
## [10] "Extremadura"            "Galicia"                "Melilla"               
## [13] "Navarra"                "País Vasco"             "Principado de Asturias"


unique(sort(df_sf$Texto))
##  [1] "Andalucía"                  "Aragón"                    
##  [3] "Canarias"                   "Cantabria"                 
##  [5] "Castilla - La Mancha"       "Castilla y León"           
##  [7] "Cataluña"                   "Ceuta"                     
##  [9] "Comunidad de Madrid"        "Comunidad Foral de Navarra"
## [11] "Comunidad Valenciana"       "Extremadura"               
## [13] "Galicia"                    "Islas Baleares"            
## [15] "La Rioja"                   "Melilla"                   
## [17] "País Vasco"                 "Principado de Asturias"    
## [19] "Región de Murcia"

The dataset quant_prices_region has fewer regions than the one with all the regions in Spain. But at least the names are practically the same (expect Comunidad Foral de Navarra—for the existing regions.) Thus, the min-string-distance approach is likely to work very well.

quant_prices_region[quant_prices_region$region == 'Navarra', 'region'] = 'Comunidad Foral de Navarra'
# Iterate over the dataset quant_prices_region, compute the string min-distance and assign the 'codigo' to it
for (i in 1:nrow(quant_prices_region)) {
  distances <- stringdist(quant_prices_region$region[i], df_sf$Texto, method = 'lcs') # method: longest similar sequence
  min_distance <- which.min(distances)
  
  quant_prices_region$Codigo[i] = df_sf$Codigo[min_distance]
}

# we joint the frames and inspect it for correctness
joined_frames <- left_join(quant_prices_region, df_sf, by = 'Codigo')

#view(joined_frames)

The region have been correctly assigned. Now the regions need to be brought inside the shapefile. Now the dataframes need to be joined using left_join(), where the left dataframe is the df_sf-frame.

mean_prices_to_be_joined <- select(quant_prices_region, Codigo, mean_price, median, n)
df_sf <- left_join(df_sf, quant_prices_region, by = 'Codigo')

Next the NA values are replaced with zeros:

merged_data <- merge(sf_aa_cc, mean_prices_to_be_joined, by.x = 'Codigo', by.y = 'Codigo', all.x = TRUE)
merged_data$center <- st_centroid(merged_data$geometry)

merged_data <- mutate(merged_data,
       cod_text = paste(Codigo, Texto, sep = " "))

Some region names will now be shortened, so that they can be used as labels in the plots.

Old Name New Name
Principado de Austurias Austurias
Catilla y León Cas.y. León
Comunidad de Matrid Madrid
Comunidad Foral de Navarra Navarra
Islas Baleares I. Baleares
Castilla - La Mancha La Mancha
Comunidad Valenciana Valencia
Region de Murcia Murcia


old_names <- c('Principado de Asturias', 'Castilla y León', 'Comunidad de Madrid',
               'Comunidad Foral de Navarra', 'Islas Baleares', 'Castilla - La Mancha',
               'Comunidad Valenciana', 'Región de Murcia')
new_names <- c('Asturias', 'Cas. y León', 'Madrid', 'Navarra', 'I. Baleares', 'La Mancha',
               'Valencia', 'Murcia')

for (i in 1:length(old_names)) {
  merged_data$cod_text <- gsub(old_names[i], new_names[i], merged_data$cod_text)
}

The graphic bellow shows the mean prices of real estates divided by autonomous comunities in Spain:

map_plot <- ggplot() +
  geom_sf(data = merged_data, aes(fill = mean_price)) +
  scale_fill_gradient2(low = "lightblue", 
                       mid = 'white', 
                       high = "red", 
                       midpoint = median(merged_data$mean_price, na.rm = TRUE), 
                       label = comma_format()) +
  geom_text(data = merged_data, 
            aes(x = st_coordinates(center)[, 1], y = st_coordinates(center)[, 2], label = Codigo),
            color = "black",
            size = 3,
            nudge_x = 0.1,
            nudge_y = 0.1) +
  labs(fill = "Mean Price",  title = 'Mean Housing Prices Per Autonomous Region') +
  theme_minimal() +
  theme(plot.margin = margin(0, 0, 0, 0, "cm"), 
        axis.title.x = element_blank(), 
        axis.title.y = element_blank())

# Create a separate label plot with region names
label_plot <- ggplot() +
  geom_text(data = merged_data, aes(x = 1, y = Codigo, label = cod_text), 
            color = "black", 
            size = 2, 
            hjust = 0, 
            vjust = 0.5) +
  labs(x = "", y = "") +
  theme_minimal() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.title.x = element_blank(), 
        axis.title.y = element_blank()
        )

# Combine the map plot and label plot using the patchwork package
combined_plot <- map_plot + label_plot +
  plot_layout(ncol = 2, widths = c(11.5, 2.9)) +
  theme(legend.text = element_text(size = 105)) 
  
# Display the combined plot
combined_plot


One should notice that the dataset has severe problem with outliers:


plot_ar <- ggplot(groupped_by_region, aes(x = region, y = price)) +
  geom_boxplot() +
  labs(title = "Boxplot: Prices per Autonomous Region", x = "Region", y = "Prices") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 35, hjust = 1)) +
  scale_y_continuous(labels = comma_format())

plot_ar

such that it makes sense to plot the prices per region based on the median value instead of the mean.


map_plot <- ggplot() +
  geom_sf(data = merged_data, aes(fill = median)) +
  scale_fill_gradient2(low = "lightblue", 
                       mid = 'white', 
                       high = "red", 
                       midpoint = median(merged_data$median, na.rm = TRUE), 
                       label = comma_format()) +
  geom_text(data = merged_data, 
            aes(x = st_coordinates(center)[, 1], y = st_coordinates(center)[, 2], label = Codigo),
            color = "black",
            size = 3,
            nudge_x = 0.1,
            nudge_y = 0.1) +
  labs(fill = "Median Price",  title = 'Median Housing Prices Per Autonomous Region') +
  theme_minimal() +
  theme(plot.margin = margin(0, 0, 0, 0, "cm"), 
        axis.title.x = element_blank(), 
        axis.title.y = element_blank())

# Create a separate label plot with region names
label_plot <- ggplot() +
  geom_text(data = merged_data, aes(x = 1, y = Codigo, label = cod_text), 
            color = "black", 
            size = 2, 
            hjust = 0, 
            vjust = 0.5) +
  labs(x = "", y = "") +
  theme_minimal() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.title.x = element_blank(), 
        axis.title.y = element_blank()
        )

# Combine the map plot and label plot using the patchwork package
combined_plot <- map_plot + label_plot +
  plot_layout(ncol = 2, widths = c(11.5, 2.9)) +
  theme(legend.text = element_text(size = 105)) 

#ggsave('./00_Pictures/02_Processed/10_MedianPrice_Region.png', plot = map_plot)
# Display the combined plot
combined_plot


By ranking the autonomous regions acording to their median and means, one obtains

merged_data <- merged_data %>% mutate(Rank_Mean = rank(-mean_price),
                                      Rank_Median = rank(-median),
                                      )

display_merged <- select(merged_data, !c('geometry'))

datatable(display_merged, options = list(pageLength = 10))


Extra: Produce table to be exported to latex:

latex_table <- as.data.frame(display_merged)
latex_table <- select(latex_table, Codigo, Texto, median, Rank_Median, n)

column_names <- c('Code', 'Regions', 'Median Price in €', 'Rank-Median', 'No. Obs.')

colnames(latex_table) <- column_names

latex_table <- arrange(latex_table, `Rank-Median`)


latex_table[is.na(latex_table$`Median Price`), 'Rank-Median'] = 'NA'

write_latex_table <- kable(latex_table, format = "latex", booktabs = TRUE)

writeLines(write_latex_table, 'table.tex')

Nevertheless, one must be careful about generalization about the price distribution, since some regions are much better represented than other. The following graph shows for which regions is this the case:

price_dist_plot <- ggplot(groupped_by_region, aes(x = region, y = price)) +
  geom_boxplot() +
  labs(title = "Boxplot: Prices per Autonomous Region", x = "Region", y = "Prices (scale = log10)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 35, hjust = 1)) +
  scale_y_continuous(labels = comma_format()) + scale_y_log10()
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
# ----------------------------------

number_units <- summarize(groupped_by_region, numbers = n())
n_per_region <-ggplot(number_units, aes(x = region, y = numbers, fill = region)) + 
  geom_bar(stat = 'identity') + 
  labs(x = 'Region', y = 'No. of Datapoints', title = 'Number of Obs. Per Autonomous Region') +
  theme_minimal() +
  scale_y_continuous(limits = c(0, NA)) +
  theme(legend.text = element_text(size =5),
        legend.key.size = unit(0.2, 'cm'),
        axis.text.x = element_blank(),
        legend.position = 'none') +
  guides(fill = guide_legend(override.aes = list(size = 0.01, ncol = 3)))  

combined_plots <- grid.arrange(price_dist_plot, n_per_region, nrow = 2, heights = c(1.5, 1))

#ggsave(plot = combined_plots, "./00_Pictures/02_Processed/11_Boxplot_Region.png",
       #height = 5, width = 10)
combined_plots
## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

The same analysis as above will be done now at a province level:


groupped_by_province <- group_by(dataset, province)
groupped_by_province %>% 
  summarize(
    lower_quarter = quantile(price, probs = 0.25),
    median = quantile(price, probs = 0.5),
    upper_quarter = quantile(price, probs = 0.75),
    mean_price = mean(price)
  ) %>% 
  select(lower_quarter, median, upper_quarter, mean_price, province) -> quant_prices_province 

quant_prices_province
## # A tibble: 40 × 5
##    lower_quarter median upper_quarter mean_price province   
##            <dbl>  <dbl>         <dbl>      <dbl> <chr>      
##  1        162500 230000        297500    230000  Albacete   
##  2        110000 165000        297500    240856. Alicante   
##  3         63915 111000        165000    141029. Almería    
##  4         74000  85000         98500     96488. Badajoz    
##  5        185000 279250        463750    421767. Barcelona  
##  6         86925 148500        214500    150483. Burgos     
##  7        157250 159500        279500    226250  Cantabria  
##  8        115000 148000        219900    167140. Castellón  
##  9         34775  61000         90000    114278. Ciudad Real
## 10        140000 140000        140000    140000  Cuenca     
## # ℹ 30 more rows

Next we import the shapefile containing the map of Spain divided in provinces.


sf_provinces <- st_read('./02_SF_Provinces/Provincias_ETRS89_30N.shp')
## Reading layer `Provincias_ETRS89_30N' from data source 
##   `/Volumes/ExternalNew/Documents/01_Studies/01_University/01_Master/S.S.2023/03_Topics_in_Empirical_Research/03_Project/00_Project_LFE_8105069/02_SF_Provinces/Provincias_ETRS89_30N.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 52 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -1004500 ymin: 3132140 xmax: 1126930 ymax: 4859240
## Projected CRS: ETRS89 / UTM zone 30N

We next create a column named Codigo to the the quant_prices_province dataset and use it to store the matched codigos by using the stringdist() function. We first visually inspect the unique provinces in both datasets. Names that are very different but still the same province will be changed a priori, in order to increase the accuracy of the string-min-distance method.


unique(sort(quant_prices_province$province))
##  [1] "Álava"                  "Albacete"               "Alicante"              
##  [4] "Almería"                "Badajoz"                "Barcelona"             
##  [7] "Burgos"                 "Cáceres"                "Cádiz"                 
## [10] "Cantabria"              "Castellón"              "Ciudad Real"           
## [13] "Córdoba"                "Cuenca"                 "Granada"               
## [16] "Guadalajara"            "Huelva"                 "Huesca"                
## [19] "Jaén"                   "Las Palmas"             "León"                  
## [22] "Lugo"                   "Madrid"                 "Málaga"                
## [25] "Melilla"                "Navarra"                "Palencia"              
## [28] "Pontevedra"             "Principado de Asturias" "Salamanca"             
## [31] "Santa Cruz de Tenerife" "Segovia"                "Sevilla"               
## [34] "Tarragona"              "Teruel"                 "Toledo"                
## [37] "Valencia"               "Valladolid"             "Zamora"                
## [40] "Zaragoza"

Next the names in the shapefile:


unique(sort(sf_provinces$Texto))
##  [1] "Álava"                  "Albacete"               "Alicante"              
##  [4] "Almería"                "Asturias"               "Ávila"                 
##  [7] "Badajoz"                "Barcelona"              "Burgos"                
## [10] "Cáceres"                "Cádiz"                  "Cantabria"             
## [13] "Castellón"              "Ceuta"                  "Ciudad Real"           
## [16] "Córdoba"                "Cuenca"                 "Gerona"                
## [19] "Granada"                "Guadalajara"            "Guipúzcoa"             
## [22] "Huelva"                 "Huesca"                 "Islas Baleares"        
## [25] "Jaén"                   "La Coruña"              "La Rioja"              
## [28] "Las Palmas"             "León"                   "Lleida"                
## [31] "Lugo"                   "Madrid"                 "Málaga"                
## [34] "Melilla"                "Murcia"                 "Navarra"               
## [37] "Orense"                 "Palencia"               "Pontevedra"            
## [40] "Salamanca"              "Santa Cruz de Tenerife" "Segovia"               
## [43] "Sevilla"                "Soria"                  "Tarragona"             
## [46] "Teruel"                 "Toledo"                 "Valencia"              
## [49] "Valladolid"             "Vizcaya"                "Zamora"                
## [52] "Zaragoza"

The name of Principado de Asturias will be changed to Asturias in order to increase accuracy by using the method of min-string-distance.

quant_prices_province[quant_prices_province$province == 'Principado de Asturias', 'province'] = 'Asturias'
# Create a column to store the ids
quant_prices_province$Codigo <- NA
# Now we compute the mean distance by iterating over our daaset
for (i in 1:nrow(quant_prices_province)) {
  quant_prices_province$Codigo[i] = sf_provinces$Codigo[which.min(stringdist(sf_provinces$Texto, quant_prices_province$province[i]))]
  
}



# Let us join the frames and check for correctness
joined_frames <- left_join(quant_prices_province, sf_provinces, by = 'Codigo')
#view(joined_frames)

Once correctness of the assingments is visually inspected, one can merge both datasets and use it to produce the map.


merged_data <- merge(sf_provinces, quant_prices_province, by.x = 'Codigo', by.y = 'Codigo', all.x = TRUE)
# We shorten the name of santa cruz de tenerife to S.C. Tenerife
merged_data[merged_data$Texto == 'Santa Cruz de Tenerife', 'Texto'] = 'S.C. Tenerife'

# We create a column with the names of the provinces plus their code, in order for us to use them as labels
merged_data <- mutate(merged_data, cod_text = paste(Codigo, Texto, sep = " "))
as_tibble(merged_data)
## # A tibble: 52 × 12
##    Codigo Texto      Texto_Alt Cod_CCAA CCAA  lower_quarter median upper_quarter
##    <chr>  <chr>      <chr>     <chr>    <chr>         <dbl>  <dbl>         <dbl>
##  1 01     Álava      Araba     16       País…        192750 254500        335000
##  2 02     Albacete   Albacete  08       Cast…        162500 230000        297500
##  3 03     Alicante   Alacant   10       Comu…        110000 165000        297500
##  4 04     Almería    Almería   01       Anda…         63915 111000        165000
##  5 05     Ávila      Ávila     07       Cast…            NA     NA            NA
##  6 06     Badajoz    Badajoz   11       Extr…         74000  85000         98500
##  7 07     Islas Bal… Illes Ba… 04       Ille…            NA     NA            NA
##  8 08     Barcelona  Barcelona 09       Cata…        185000 279250        463750
##  9 09     Burgos     Burgos    07       Cast…         86925 148500        214500
## 10 10     Cáceres    Cáceres   11       Extr…        457000 457000        457000
## # ℹ 42 more rows
## # ℹ 4 more variables: mean_price <dbl>, province <chr>,
## #   geometry <MULTIPOLYGON [m]>, cod_text <chr>

Next the map will be created.


# Get max and min coordinates
max_x <- c()
max_y <- c()
min_x <- c()
min_y <- c()
# This will be used to set the limits of the map in order to focus on the mainland
for (i in 1:length(merged_data$geometry)) {
  matrix <- merged_data$geometry[[i]][[1]][[1]]
  max_x <- append(max_x, max(matrix[, c(1)]))
  max_y <- append(max_y, max(matrix[, c(2)]))
  min_x <- append(min_x, min(matrix[, c(1)]))
  min_y <- append(min_y, min(matrix[, c(2)]))
}
merged_data$center <- sf::st_centroid(merged_data$geometry)
#mainland_limits <- c(min(min_x) /5  , max(max_x), min(min_y)*1.15, max(max_y))
mainland_limits <- c(min(min_x)  , max(max_x), min(min_y), max(max_y))
map_plot <- ggplot() +
  geom_sf(data = merged_data, aes(fill = mean_price)) +
  scale_fill_gradient2(low = "lightblue", 
                       mid = 'white', 
                       high = "red", 
                       midpoint = median(merged_data$median, na.rm = TRUE), 
                       label = comma_format()) +
  geom_text(data = merged_data, 
            aes(x = st_coordinates(center)[, 1], y = st_coordinates(center)[, 2], label = Codigo),
            color = "black",
            size = 2,
            nudge_x = 0.1,
            nudge_y = 0.1) +
  labs(fill = "Mean Price €",  title = 'Mean Housing Price Per Province (Spain)') +
  theme_minimal() +
  theme(plot.margin = margin(0, 0, 0, 0, "cm"), 
        axis.title.x = element_blank(), 
        axis.title.y = element_blank()) +
  coord_sf(xlim = mainland_limits[1:2], ylim = mainland_limits[3:4])

# Create a separate label plot with region names
label_plot <- ggplot() +
  geom_text(data = merged_data, aes(x = 1, y = Codigo, label = cod_text), 
            color = "black", 
            size = 2, 
            hjust = 0, 
            vjust = 0.5) +
  labs(x = "", y = "") +
  theme_minimal() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.title.x = element_blank(), 
        axis.title.y = element_blank()
        )

# Combine the map plot and label plot using the patchwork package
combined_plot <- map_plot +  label_plot +
  plot_layout(ncol = 2, widths = c(11.5, 2.9)) +
  theme(legend.text = element_text(size = 105)) 
  
# Display the combined plot
combined_plot

Let us now plot a boxplot to check for the outliers per province.

price_dist_plot <- ggplot(groupped_by_province, aes(x = province, y = price)) +
  geom_boxplot() +
  labs(title = "Boxplot: Prices per Province", x = "Province", y = "Prices (scale = log10)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 35, hjust = 1)) +
  scale_y_continuous(labels = comma_format()) + scale_y_log10()
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
price_dist_plot

Next, the median hounsing-prices per provinces will be plotted.


merged_data$center <- sf::st_centroid(merged_data$geometry)
#mainland_limits <- c(min(min_x) /5  , max(max_x), min(min_y)*1.15, max(max_y))
mainland_limits <- c(min(min_x)  , max(max_x), min(min_y), max(max_y))
map_plot <- ggplot() +
  geom_sf(data = merged_data, aes(fill = median)) +
  scale_fill_gradient2(low = "lightblue", 
                       mid = 'white', 
                       high = "red", 
                       midpoint = median(merged_data$median, na.rm = TRUE), 
                       label = comma_format()) +
  geom_text(data = merged_data, 
            aes(x = st_coordinates(center)[, 1], y = st_coordinates(center)[, 2], label = Codigo),
            color = "black",
            size = 2,
            nudge_x = 0.1,
            nudge_y = 0.1) +
  labs(fill = "Median Price €",  title = 'Median Housing Prices Per Province (Spain)') +
  theme_minimal() +
  theme(plot.margin = margin(0, 0, 0, 0, "cm"), 
        axis.title.x = element_blank(), 
        axis.title.y = element_blank()) +
  coord_sf(xlim = mainland_limits[1:2], ylim = mainland_limits[3:4])

# Create a separate label plot with region names
label_plot <- ggplot() +
  geom_text(data = merged_data, aes(x = 1, y = Codigo, label = cod_text), 
            color = "black", 
            size = 2, 
            hjust = 0, 
            vjust = 0.5) +
  labs(x = "", y = "") +
  theme_minimal() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.title.x = element_blank(), 
        axis.title.y = element_blank()
        )

# Combine the map plot and label plot using the patchwork package
combined_plot <- map_plot +   label_plot +
  plot_layout(ncol = 2, widths = c(11.5, 2.9)) +
  theme(legend.text = element_text(size = 105)) 
  
# Display the combined plot
#ggsave("./00_Pictures/02_Processed/06_MedianHousingPrices.png", plot = combined_plot)
combined_plot

What is really surprising, is that Navarra and Cáceres have a very high-median. Higher than Madrid. Maybe this is because of the sample size?

number_units <- summarize(groupped_by_province, numbers = n())
n_per_province <-ggplot(number_units, aes(x = province, y = numbers, fill = province)) + 
  geom_bar(stat = 'identity') + 
  labs(x = 'Province', y = 'No. of Datapoints', title = 'Distribution of Data Per Province') +
  theme_minimal() +
  scale_y_continuous(limits = c(0, NA)) +
  theme(legend.text = element_text(size =5),
        legend.key.size = unit(0.2, 'cm'),
        axis.text.x = element_blank(),
        legend.position = 'none') +
  guides(fill = guide_legend(override.aes = list(size = 0.01, ncol = 3)))  

combined_plots <- grid.arrange(price_dist_plot, n_per_province, nrow = 2, heights = c(1.5, 1))

#ggsave(plot = combined_plots, "./00_Pictures/02_Processed/08_data_dis_prov.png")
combined_plots
## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

Maybe the situation is better if we divide the observations by region?


number_units <- summarize(groupped_by_region, numbers = n())
max_no <- max(number_units$numbers)
ggplot(number_units, aes(x = region, y = numbers, fill = region)) + 
  geom_bar(stat = 'identity') + 
  labs(x = 'region', y = 'No. of Datapoints', title = 'Distribution of Data Per Autonomous Comunity') +
  theme_minimal() +
  scale_y_continuous(limits = c(0, NA)) +
  theme(legend.text = element_text(size =5),
        legend.key.size = unit(0.3, 'cm'),
        axis.text.x = element_blank()) +
  guides(fill = guide_legend(override.aes = list(size = 0.2))) 

number_units
## # A tibble: 15 × 2
##    region                 numbers
##    <chr>                    <int>
##  1 Andalucía                 3013
##  2 Aragón                     613
##  3 Canarias                   116
##  4 Cantabria                    8
##  5 Castilla y León            165
##  6 Castilla-La Mancha          32
##  7 Cataluña                  3276
##  8 Comunidad de Madrid       2185
##  9 Comunidad Valenciana      2524
## 10 Extremadura                  9
## 11 Galicia                    606
## 12 Melilla                      1
## 13 Navarra                      3
## 14 País Vasco                 298
## 15 Principado de Asturias       1

Probably this data is not only highly concentraded by provinces and regions but also by cities.

groupped_by_municipality <- group_by(dataset, municipality)

mutate(groupped_by_municipality, 
       numbers = n(),
       weights = numbers / nrow(dataset)) %>% 
  select(municipality, numbers, weights) -> number_units

unique(arrange(number_units, desc(numbers)))
## # A tibble: 172 × 3
## # Groups:   municipality [172]
##    municipality            numbers weights
##    <chr>                     <int>   <dbl>
##  1 Madrid                     2185  0.170 
##  2 Barcelona                  2140  0.167 
##  3 Córdoba                    1191  0.0927
##  4 Elche                      1179  0.0918
##  5 Valencia                   1117  0.0869
##  6 Hospitalet de Llobregat    1087  0.0846
##  7 Málaga                      851  0.0662
##  8 Sevilla                     832  0.0647
##  9 Zaragoza                    602  0.0468
## 10 Vigo                        600  0.0467
## # ℹ 162 more rows

Computing weights for regions yields:

arrange(unique(select(mutate(groupped_by_region, weights = n() / nrow(dataset)), region, weights)), desc(weights))
## # A tibble: 15 × 2
## # Groups:   region [15]
##    region                   weights
##    <chr>                      <dbl>
##  1 Cataluña               0.255    
##  2 Andalucía              0.234    
##  3 Comunidad Valenciana   0.196    
##  4 Comunidad de Madrid    0.170    
##  5 Aragón                 0.0477   
##  6 Galicia                0.0472   
##  7 País Vasco             0.0232   
##  8 Castilla y León        0.0128   
##  9 Canarias               0.00903  
## 10 Castilla-La Mancha     0.00249  
## 11 Extremadura            0.000700 
## 12 Cantabria              0.000623 
## 13 Navarra                0.000233 
## 14 Melilla                0.0000778
## 15 Principado de Asturias 0.0000778

Apparently the concentration is very high for those regions containing metropolitan areas. The next code will be used to summarize the number of unique regions, provinces and cities in the dataset.

unique_locations <- tibble(
  'no. cities' = length(unique(dataset$municipality)),
  'no. regions' = length(unique(dataset$region)),
  'no. provinces' = length(unique(dataset$province))
)

datatable(unique_locations, options = list(pageLength = 10))

3. Linear Regression

3.1. The Creation of a Regression Subset


In this section a linear regression model is run in order to gain understanding of the key factors driving real estate prices in Spain. The section starts by describing the variables that will be used as covariates, further, adding a little explanation about why controlling for such variable is meaningful.

3.1.1. Explanation: Dependent Variables and its Covariates

  • Price (\(\mathsf{price}\)): the dependent variable, the one we would like to predict.
  • Floor (\(\mathsf{floor}\)): the floor where the apartment or house is located. The floor variable will be coded from -3 (subbasement) to 16 (above the 15th floor). each of the numbers represent a floor-level. -1 is the so called ‘entresuelo’ or mezzanine. The 0 is the principal or the ground floor. Values from 1 to 15 represent is the floor level.
    • Explanation: normally, at the least in the case of apartments higher floors are related to higher prices, the reason is that the higher the floor is, the less noisy the apartment is. Further the inhabitant have nicer views over the city.
  • type: this variable will be broken into their unique values, as they do not have a meaningfull ordering.
    • People have different tastes for property types. Houses give the owener the freedom to change its features, to be noisy and more privace; penthouses give the owner access to a private space and nice viewer over the city, so the type of housing matters. The types of houses in the dataset are the following: apartamento (small-apartment), piso (apartment), duplex, ático (penthouse), casa adosada (row house), casa o chalet (house or villa), Planta baja (apartment on the ground-floor)—this one will be changed to piso as this information is already contained in the floor variable)— finca rústica (ranch), loft and Estudio (studio)— loft and studio will be assigned the values loft, they are practically the same. In summary, we create the following variables out of the type column, which will take the value 1 if true and 0 otherwise:
      • \(\mathsf{sapt}:\text{small apartment (apartamento)}\);
      • \(\mathsf{apt}:\text{apartment (piso or planta baja)}\);
      • \(\mathsf{duplex}:\text{duplex}\);
      • \(\mathsf{penth}:\text{penthouse (ático)}\);
      • \(\mathsf{rhouse}:\text{row-house (casa adosada)}\);
      • \(\mathsf{house}:\text{house or cottage/villa (casa o chalet)}\);
      • \(\mathsf{ranch}:\text{ranch (finca rústica)}\);
  • Energy class (\(\mathsf{energy\_class}\)): The energy class as defined in Section 2.1.
    • Explanation: the lower the class, the more energy efficient is the house. This leads to less pollution and also less energy consumption. As the house is more cost-efficient, this feature may be an atractive that yields a higher housing-price.
  • Populational Density (\(\mathsf{density\_sqm}\)): The populational density in the city being observed.
    • Given that equilibrium prices are an outcome of supply and demand, the density per city can be understood as proxy for demand. So the hypothesis here is that the higher the density, the higher the price i the city.
  • The salary per capita (\(\mathsf{mean\_salary}\)): Variable is the salary per capita in the city where the housing is located.
    • Explanation: The prices suppliers put on their housings must be one that is also sellable. Regions that are richer have consumers with higher willingness to pay, hence, the mean-salary in the city is likely to play a role in the pricing.
  • Condition of the building (\(\mathsf{condion}\)): As the name may suggest, this is a discrete variable about the condition of the building. The conditions in the dataset are Bien (good), Muy Bien (very good), Casi nuevo (Almost new), A reformar (needs reparation / poor), Reformado(renovated). The condition poor will be coded as \(0\), good will be coded as \(1\) and very good, renovated and almost new will be brought together and receive the coding \(2\).
    • Explanation: it is common-sense that the condition of a housing is probably related to its price. Buyers will only be willing to pay for the house based on those year that it can still be used. As the coding here is ordered, i.e. \(-1<0<1\), we expect that the higher the coding, the higher the price of the housing.
  • Parking-space (\(\mathsf{parking}\)): Takes value \(1\) if there is a parking possibility (independent of if private or not). Here, practilly 70% of the dataset has no information at all. It is not unusuall in large cities, for parking space to be rare. We will code the NA values to \(0\). Much more this variable can be understood as: the apartment ad had an answer on parking vs the ad had no information on parking.
    • Explanation: As parking space is a vary rare feature the larger the city, I expect that, if a parking space is available, that the price of the housing will be larger than for those housing where parking is not available.
  • lift (\(\mathsf{lift}\)):\(1\) if a housing has a lift and \(0\) otherwise.
    • Explanation: Extra features in the housing is likely to have a positive effect on their prices.
  • Number of rooms (\(\mathsf{room}\)):The number of rooms in the house/apt. (see the next point for explanation).
  • Number of bathrooms (\(\mathsf{bathroom}\)):The number of bathrooms in the house/apt.
    • Explanation: the number of features of the house is likely to be positively correlated with its price.
  • Size (\(\mathsf{size\_sqm}\)): The size of the building in square meters.
    • Explanation: Not only the house per se has a price but also the land on which the house is buid. Note that this size is realted to the actual built area, not the whole land.
  • Unemployment Rate (\(\mathsf{unemp}\)): The rate of unemployment in the province where the housing is located.
    • Explanation: here again is the hypothesis of income playing a role in the housing price. If people have a job, even if their income are not too high, they are able to financing goods.
  • Criminality rate (\(\mathsf{crime\_pronvice}\)): Is the criminality rate per 100,000 persons in the province where the object is located.
    • Explanation: safety is seen as an important factor for a person, and a region with high criminality may seen a decrease in demand. Hence, the hypothesis is that criminality rate and the housing-prices are inversely correlated.
  • Age (\(\mathsf{age}\)): the age of the object. In the dataset the age is given in buckets. which will be coded as following \[ \mathsf{age}_{i}=\begin{cases} 0 & \text{, if age} < 5\\ 1 & \text{, if } 5 \leq \text{age} < 10\\ 2 & \text{, if } 10 \leq \text{age} < 20\\ 3 & \text{, if } 20 \leq \text{age} < 30\\ 4 & \text{, if } 30 \leq \text{age} < 50\\ 5 & \text{, if } 50 \leq \text{age} < 70\\ 6 & \text{, if } 70 \leq \text{age} < 100\\ 7 & \text{, if otherwise} \end{cases} \]
    • Explanation:Age is a proxy of how long the house can be still used. My hypothesis is that age and the price variables are non-linearly correlated. Houses that are middle aged or old and in poor condition will be negatively correlated with prices, houses that are old but in a vary good condition (also known as historic), are likely to show a high price. We will create an interaction term between age and house-condition in order to capture such relation.
    <li><b>Population per city ($\mathsf{pop}$):</b> The population in the city where the object is located.</b>
      </li>
        <ul>
          <li> As in the case of the density, the population per se can be used as a predictor to explain the prices of the housings.
          </li>
        </ul>
    </ul>

3.1.2. Encode of Discrete Variables

In this section the variable as in Section 3.1.1 will be encoded. First of all, a subset named subset with only those variables that will be used for the model estimation is created.. Notice that the column named X is actually an ID column, which will be retained for the case, datasets need to be joined again.

colnames(dataset)[1] <- 'ID'
variables <- c('ID', 'price', 'room', 'bathroom', 'size_sqm', 'type', 'floor', 'condition',
               'age', 'parking', 'lift', 'energy_class', 'density_skm', 'mean_salary',
               'unemp', 'crime_province', 'pop')

# create columns to store the different types of housing in the new dataset (subset)
dataset[, variables] %>%
  mutate(sapt = ifelse(dataset$type == 'Apartamento', 1, 0),
         apt = ifelse((dataset$type == 'Piso' | type == 'Planta baja'), 1, 0),
         duplex = ifelse(dataset$type == 'Dúplex', 1, 0),
         penth = ifelse(dataset$type == 'Ático', 1, 0),
         rhouse = ifelse(dataset$type == 'Casa adosada', 1, 0),
         house = ifelse(dataset$type == 'Casa o chalet', 1, 0),
         loft = ifelse((dataset$type == 'Loft' | type == 'Estudio'), 1, 0),
         ranch = ifelse(dataset$type == 'Finca rústica', 1, 0)
  ) %>% # Now select all columns but the type one and assign it to a dataframe named subset
    select(-type) %>% as_tibble() -> subset 
# We save this dataset and give it the name 'RawRegressionSet'
#write.csv(subset,'./01_RawDataSets/07_RawRegressionSet.csv', fileEncoding = 'latin1')
RawRegressionSet <- read.csv('./01_RawDataSets/07_RawRegressionSet.csv', fileEncoding = 'latin1')
RawRegressionSet <- select(RawRegressionSet, -'X')

Next, The missing-values in for the variable parking will be substituted with ‘no’ —recall the explanation in Section 3.1.1. about considering missing values in this one special case (as well in the lift column) as being no, instead of non-available—afterwards all the rows containing missing values in any of the other columns will be dropped.

subset <- RawRegressionSet
subset[subset$parking == -9, 'parking'] = 'no'

# Now drop all the rows containing the value -9
subset %>% 
  filter(across(everything(), ~. !=-9)) -> subset
## Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
## ℹ Please use `if_any()` or `if_all()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
datatable(subset, options = list(pageLength = 10))

The dataset was now reduced to 3,157 observations. This is the dataset we will work with when doing linear regression. The next step is to encode the variables.

# First of all, create a list with the age-buckets and name them according to their encoding value
age_list <- sort(unique(subset$age))
names(age_list) <- c(7, 0, 2, 3, 4, 1, 5, 6, 0)

# Now we do the same for the floor variable
floor_list <- sort(unique(subset$floor))
names(floor_list) <- c(10, 11, 12, 13, 14, 15, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, 0, -2, -3, 16)

# We use the lists above and their names to match the values in a further step and pass their names (the econded value) to the matched rows

subset %>% mutate(
  parking = ifelse(subset$parking == 'no', 0, 1), # encode the parking variable assigning 0 to no and 1 otherwise
  lift = ifelse(subset$lift == 'no', 0, 1), # enconde the lift varible assigning 0 to no and 1 otherwise
  condition = ifelse(subset$condition == 'A reformar', 0, # encode the condition variable assigning 0 to poor
                     ifelse(subset$condition == 'Bien', 1, # 1 to good
                            ifelse((subset$condition == 'Casi nuevo' | condition == 'Muy Bien'), 2, 0) # 2 to very                                                                                  almost new or very good
                            )
                     ),
  age = as.numeric(names(age_list)[match(age, unlist(age_list))]), # match the values of the column age with the values stored in the list above, pass the names stored in the list to the matched values.
  floor = as.numeric(names(floor_list)[match(floor, unlist(floor_list))]) # do the same as above for the floor variable and finally assign the values to a dataset called LR_dataset
                  ) -> LR_dataset

# The unemployment rate is divided by 100 
#LR_dataset$unemp <- LR_dataset$unemp/100
#Save the dataset just created as the CleanedRegressionSet
#write.csv(LR_dataset, './01_RawDatasets/08_CleanedRegressionSet.csv', fileEncoding = 'latin1')
LR_dataset <- read.csv('./01_RawDatasets/08_CleanedRegressionSet.csv', fileEncoding = 'latin1')
LR_dataset <- select(LR_dataset, !X)

In the following section, we will do some exploratory analysis based on the subset LR_dataset.


3.2. Exploratory Analysis on the Regression Dataset

First some descriptive statistics is computed in order to understand the dataset a little better:

summary(LR_dataset)
##        ID            price              room          bathroom    
##  Min.   :    4   Min.   :  35000   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 3244   1st Qu.: 178000   1st Qu.: 2.00   1st Qu.: 1.00  
##  Median : 6372   Median : 270000   Median : 3.00   Median : 2.00  
##  Mean   : 6372   Mean   : 397330   Mean   : 3.01   Mean   : 1.71  
##  3rd Qu.: 9542   3rd Qu.: 450000   3rd Qu.: 4.00   3rd Qu.: 2.00  
##  Max.   :12847   Max.   :5000000   Max.   :11.00   Max.   :21.00  
##     size_sqm        floor         condition          age          parking     
##  Min.   :  21   Min.   :-3.00   Min.   :0.000   Min.   :0.00   Min.   :0.000  
##  1st Qu.:  72   1st Qu.: 1.00   1st Qu.:0.000   1st Qu.:4.00   1st Qu.:0.000  
##  Median :  92   Median : 2.00   Median :1.000   Median :5.00   Median :0.000  
##  Mean   : 113   Mean   : 2.96   Mean   :0.846   Mean   :4.35   Mean   :0.266  
##  3rd Qu.: 122   3rd Qu.: 4.00   3rd Qu.:2.000   3rd Qu.:5.00   3rd Qu.:1.000  
##  Max.   :9618   Max.   :16.00   Max.   :2.000   Max.   :7.00   Max.   :1.000  
##       lift        energy_class   density_skm       mean_salary   
##  Min.   :0.000   Min.   :0.00   Min.   :   15.9   Min.   : 9246  
##  1st Qu.:1.000   1st Qu.:2.00   1st Qu.: 4816.7   1st Qu.:12490  
##  Median :1.000   Median :5.00   Median : 5688.7   Median :16750  
##  Mean   :0.764   Mean   :3.79   Mean   : 9479.9   Mean   :14797  
##  3rd Qu.:1.000   3rd Qu.:5.00   3rd Qu.:16508.9   3rd Qu.:16750  
##  Max.   :1.000   Max.   :6.00   Max.   :21406.8   Max.   :17059  
##      unemp        crime_province      pop               sapt       
##  Min.   :0.0635   Min.   :2687   Min.   :   5910   Min.   :0.0000  
##  1st Qu.:0.1015   1st Qu.:5340   1st Qu.: 319515   1st Qu.:0.0000  
##  Median :0.1101   Median :5904   Median :1636193   Median :0.0000  
##  Mean   :0.1158   Mean   :5628   Mean   :1426464   Mean   :0.0187  
##  3rd Qu.:0.1171   3rd Qu.:6412   3rd Qu.:1636193   3rd Qu.:0.0000  
##  Max.   :0.2606   Max.   :6412   Max.   :3280782   Max.   :1.0000  
##       apt            duplex           penth            rhouse       
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :1.000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.873   Mean   :0.0253   Mean   :0.0576   Mean   :0.00348  
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##      house             loft            ranch        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.0165   Mean   :0.0038   Mean   :0.00158  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000

For the variables related to the type of the housing, their mean value are complementary, i.e. they sum up to 1. 87.3% of the dataset is composed by apartments, the second type of housing more frequent in the data is penthouse, that is somewhat a type of apartment. Houses represent together \((rhouse + house) = 0.0348 + 0.0165 = 0.0513 \approx 5\%\) in the dataset. So. the average housing features in the dataset (based on the mean of the values) is an apartment, with 3 rooms, 2 bathrooms with a size of 113 sqm, further, it is located on the third floor, has a moderate to bad energy efficiency score classification (3.79), is around 50 year old (class 4.35), it has a lift but probably no parking and is located in a city that is large to very large (1636193 pop).

Due to the great difference between the mean and median in the variables \(\mathsf{price}\) and \(\mathsf{density}\), an outlier problem is much likely to occur.

to_plot  <- gather(LR_dataset, variable, value)
to_exclude <- c('ID', 'age', 'parking', 'lift', 'mun_class', 'sapt', 'apt', 'duplex', 'penth', 'house', 'rhouse', 'loft', 'ranch', 'condition', 'floor', 'energy_class')

cont_to_plot  <- filter(to_plot, !variable %in% to_exclude)
level_plot <- ggplot(cont_to_plot, aes(x = variable, y = value)) +
  geom_boxplot() +
  xlab("Variable") +
  ylab("Value") +
  theme_minimal() +
  ggtitle("Boxplot: Continuous Covariates Full Dataset") +
  facet_wrap(~variable, scales = "free")

level_plot

Next the data is transformed logs-value and proofed if this can have any benefit to the analysis


log_variables <- c('bathroom', 'crime_province', 'price', 'room', 'size_sqm', 'unemp', 'pop', 'density_skm')
LR_dataset %>% # apply log to the variables in the log_variables vector and save them to another dataset names LLR_dataset
  mutate_at(vars(all_of(log_variables)), log) -> LLR_dataset

The data is plotted again using this new dataset.


to_plot  <- gather(LLR_dataset, variable, value)
to_exclude <- c('ID', 'age', 'parking', 'lift', 'mun_class', 'sapt', 'apt', 'duplex', 'penth', 'house', 'rhouse', 'loft', 'ranch', 'condition', 'floor', 'room', 'bathroom')

to_plot <- filter(to_plot, !variable %in% to_exclude)
ggplot(to_plot, aes(x = variable, y = value)) +
  geom_boxplot() +
  xlab("Variable") +
  ylab("Value") +
  theme_minimal() +
  ggtitle("Boxplot: Continuous Covariates Log Values") +
  facet_wrap(~variable, scales = "free")

The approach did not really help.The problem with outliers still exists ever after logging the values. 10% of the dataset will be trimmed—the first and the final 0.5 quantiles— which is likely to solve the problem. The focus will like mainly on the distribution of price:


boundaries <- quantile(LR_dataset$price, probs = c(0.05, 0.95))
cutLR_dataset <- LR_dataset[LR_dataset$price >= boundaries[[1]] & LR_dataset$price <= boundaries[[2]],]
Lboundaries <- quantile(LLR_dataset$price, probs = c(0.05, 0.95))
cutLLR_dataset <- LLR_dataset[LLR_dataset$price >= Lboundaries[[1]] & LLR_dataset$price <= Lboundaries[[2]],]

  • Firt the plot of the level-dataset:


to_plot  <- gather(cutLR_dataset, variable, value)
to_exclude <- c('ID', 'age', 'parking', 'lift', 'mun_class', 'sapt', 'apt', 'duplex', 'penth', 'house', 'rhouse', 'loft', 'ranch', 'condition', 'floor', 'room', 'bathroom')

to_plot <- filter(to_plot, !variable %in% to_exclude)
ggplot(to_plot, aes(x = variable, y = value)) +
  geom_boxplot() +
  xlab("Variable") +
  ylab("Value") +
  theme_minimal() +
  ggtitle("Boxplot: Continuous Covariates 90%-Dataset") +
  facet_wrap(~variable, scales = "free")

Apparently cutting 10% of the dataset did not bring substantial improvement to the outlier-issue in the level-case. While for the price variable the outliers in the bottom of the distribution were removed, this was not the case for the upper tail.

  • Next the cutted-logged dataset is plotted:


to_plot  <- gather(cutLLR_dataset, variable, value)
to_exclude <- c('ID', 'age', 'parking', 'lift', 'mun_class', 'sapt', 'apt', 'duplex', 'penth', 'house', 'rhouse', 'loft', 'ranch', 'condition', 'floor', 'room', 'bathroom')

to_plot_cutted <- filter(to_plot, !variable %in% to_exclude)
logplot <- ggplot(to_plot_cutted, aes(x = variable, y = value)) +
  geom_boxplot() +
  xlab("Variable") +
  ylab("Value") +
  theme_minimal() +
  ggtitle("Boxplot: Continuous Variables 90%-Dataset Log") +
  facet_wrap(~variable, scales = "free")


#ggsave('./00_Pictures/02_Processed/09_BoxPContCov.png')
#ggsave('./00_Pictures/02_Processed/09_BoxPContCov.png', plot = logplot)

logplot

As can be seen from the plot above, at least for the dependent variable (price) the problem of outlier could be solved. Hence, the linear Regression will be carried out using the logged-values of those variables.

Again the average housing of this data is studied (in level):

# min, max, 1st Qu, Median, Mean, 3rd Qu, Max
labels = c('Min', 'Max', '1st Qu', 'Median', 'Mean', '3rd Qu')

variables <- colnames(select(cutLR_dataset, !ID))

statistics <- data.frame(
  'Variable' = variables
)

for (label in labels) {
  statistics = mutate(statistics, !!label := rep(0, 23))
}

for (row in 1:nrow(statistics)) {
  statistics[row, 'Min'] = round(min(cutLR_dataset[statistics[row, 'Variable']]), 2)
  statistics[row, 'Max'] = round(max(cutLR_dataset[statistics[row, 'Variable']]), 2)
  statistics[row, '1st Qu'] = round(quantile(unlist(cutLR_dataset[statistics[row, 'Variable']]), 0.25), 2)
  statistics[row, 'Median'] = round(quantile(unlist(cutLR_dataset[statistics[row, 'Variable']]), 0.5), 2)
  statistics[row, 'Mean'] = round(mean(unlist(cutLR_dataset[statistics[row, 'Variable']])), 3)
  statistics[row, '3rd Qu'] = round(quantile(unlist(cutLR_dataset[statistics[row, 'Variable']]), 0.75), 2)
}
datatable(statistics, options = list(pageLength = 10))

The descriptive statistics above will be exported to latex-format.

#write_latex_table <- kable(statistics, format = "latex", booktabs = TRUE)
#writeLines(write_latex_table, 'table.tex')

So, based on the average values, the average object in the dataset costs €343,483, is an apartment, has 3 rooms, 2 bathrooms, and a size of 107 square meters, is located in the third floor, is in somewhat good condition, is around 50 years old, has no parking space, has a lift, and a regular to bad energy classification, it is situated in a very large city with 1,424,755 inhabitants, in a province where the criminality rate per 100,000 pop. is equal to 5658 and its unemployment rate 11.52%, the mean-salary of the city is €14,826 and its density is 9767.6 hab/km2.

LLR_dataset <- cutLLR_dataset

new_names <- c('lprice', 'lroom', 'lbathroom', 'lsize_sqm', 'ldensity_skm', 'lmean_salary',
               'lunemp', 'lcrime_province', 'lpop')
colnames(LLR_dataset)[c(2, 3, 4, 5, 12, 13, 14, 15, 16)] <- new_names

write.csv(LLR_dataset, '09_CleanedLogRegressionSet.csv', fileEncoding = 'latin1')
LLR_dataset <- read.csv('09_CleanedLogRegressionSet.csv', encoding = 'latin1', row.names = NULL)
LLR_dataset <- select(LLR_dataset, -c('X'))

A closer look at the distribution of prices shall show that the outlier problem has been removed:

options(scipen = 2, digits = 6)
dens <- density(LLR_dataset$lprice)
plot(dens, col = 'darkblue', ylab = 'Density', xlab = 'Prices in €', main = 'LogPrice-Distribution')
polygon(dens, col = 'lightblue', border = NA)
grid()

Although the distribution do not really looks like a normal-one, as stated, there is no outlier anymore. The plot of a correlation matrix of the continuous variables may give some information about how the covariates influence the price.

cor_notvar <- c('ranch', 'loft', 'rhouse', 'house', 'sapt', 'duplex', 'energy_class', 'lift', 'parking', 'age', 'condition', 'floor', 'apt', 'penth', 'ID')


cor_matrix <- cor(select(LR_dataset, !cor_notvar))
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(cor_notvar)
## 
##   # Now:
##   data %>% select(all_of(cor_notvar))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
corrplot(cor_matrix, method = 'color', type = 'full', order = 'hclust', tl.col = 'black')

3.3. Model Estimation


The model reads:

\[\boldsymbol{\mathsf{lprice}}_{\left(2847\times1\right)}=\mathbf{X}_{\left(2847\times24\right)}\boldsymbol{\beta}_{\left(24\times1\right)}+\boldsymbol{\varepsilon}_{\left(2847\times1\right)},\,\boldsymbol{\varepsilon}\overset{i.i.d.}{\sim}{\mathbb N}\left(0,\sigma\right)\] where \(\boldsymbol{X}\) is a matrix with the covariates and a vector of \(1\)’s in its first column, \(\boldsymbol{\beta}\) is a one-column vector of predictors, starting wiht \(\mathsf{\beta_{0}}\) as its first value and \(\mathsf{\beta_{21}}\) as its last and \(\boldsymbol{\varepsilon}\) is a normally distributed vector of mean zero and homoscedastic variance. (which we will test for).

\(i\) \(x_{i}\) \(Label_{i}\)
0 1 constant
1 lroom log no. of rooms
2 lbathroom log no. of bathroom
3 lsize_sqm log built-size of property
4 floor floor where object is located
5 condition condition of the object [poor(0), good(1), very good(2)
6 age age class of the object [0:<5 years to 7>100 years]
7 parking parking space [1 if true]
8 lift has lift [1 if true]
9 energy_class energy class [0: best to 6:worst]
10 ldensity_skm log pop. density per \(km^{2}\)
11 lmean_salary log city mean salary
12 lunemp log province unemployment rate
13 lcrime_province log province criminality rate per 100,000 pop.
14 lpop log city population
15 sapt 1 if small apartment
16 apt 1 if apartment
17 duplex 1 if duplex
18 penth 1 if penthouse
19 rhouse 1 if row-house
20 house 1 if house
21 loft 1 if loft
22 lroom * lsize_sqm interaction between lsize and lroom
23 age * condition interaction between age and cond. classes

Notice the variable \(\mathsf{ranch}\) is not included in the equation. This variable is implied and is known as the base model. The base model is the one where all the factor/Dummy-variables are set to zero. In this case the base model is about a ranch, which is very new (age = 0), and is in poor condition (condition = 0), has no lift, no parking space, but a very good energy efficiency classification.

Run the regression:


model.equation <- 'lprice ~ lroom + lbathroom + lsize_sqm + floor + parking + lift + energy_class + ldensity_skm + lmean_salary + lunemp + lcrime_province + sapt + duplex + penth + rhouse + apt + house + loft + lpop + lroom * lsize_sqm + age * condition '
model <- lm(model.equation, data = LLR_dataset)

table_results <- stargazer(model,
          type = 'text', # entweder als text oder als latex output
          align = TRUE,
          header = FALSE,
          summary = TRUE,
          title = 'Regression-Results',
          style = 'aer' #Style anpassen, z.B., commadefault
           # Summary statistics, die web sollen, z.b.n
          )
## 
## Regression-Results
## ==========================================================
##                                     lprice                
## ----------------------------------------------------------
## lroom                             -0.834***               
##                                    (0.143)                
##                                                           
## lbathroom                          0.421***               
##                                    (0.022)                
##                                                           
## lsize_sqm                          0.507***               
##                                    (0.040)                
##                                                           
## floor                              0.011***               
##                                    (0.003)                
##                                                           
## parking                            0.076***               
##                                    (0.017)                
##                                                           
## lift                               0.210***               
##                                    (0.017)                
##                                                           
## energy_class                      -0.013***               
##                                    (0.004)                
##                                                           
## ldensity_skm                       0.057***               
##                                    (0.011)                
##                                                           
## lmean_salary                       0.000***               
##                                    (0.000)                
##                                                           
## lunemp                              0.006                 
##                                    (0.049)                
##                                                           
## lcrime_province                    0.425***               
##                                    (0.078)                
##                                                           
## sapt                                0.136                 
##                                    (0.159)                
##                                                           
## duplex                              -0.016                
##                                    (0.157)                
##                                                           
## penth                               0.144                 
##                                    (0.154)                
##                                                           
## rhouse                              0.014                 
##                                    (0.187)                
##                                                           
## apt                                 0.003                 
##                                    (0.152)                
##                                                           
## house                               -0.035                
##                                    (0.158)                
##                                                           
## loft                                -0.200                
##                                    (0.188)                
##                                                           
## lpop                               0.081***               
##                                    (0.017)                
##                                                           
## age                                0.034***               
##                                    (0.007)                
##                                                           
## condition                          0.060***               
##                                    (0.021)                
##                                                           
## lroom:lsize_sqm                    0.148***               
##                                    (0.032)                
##                                                           
## age:condition                       -0.004                
##                                    (0.004)                
##                                                           
## Constant                           3.892***               
##                                    (0.622)                
##                                                           
## Observations                        2,847                 
## R2                                  0.671                 
## Adjusted R2                         0.668                 
## Residual Std. Error           0.332 (df = 2823)           
## F Statistic               249.800*** (df = 23; 2823)      
## ----------------------------------------------------------
## Notes:              ***Significant at the 1 percent level.
##                     **Significant at the 5 percent level. 
##                     *Significant at the 10 percent level.
#latex_table <- print(table_results, tabular.environment = 'longtable')
#writeLines(latex_table, 'table.tex')
summary_table <- xtable(summary(model))
attr(summary_table, 'caption') <- 'Linear Regression - Results'

colnames(summary_table) <- c('Coeff.', 'Std. Error', 't-value', 'Pr(>|t|))')

#print(summary_table, tabular.environment = 'longtable')

datatable(summary_table)

Next, the model is used to make a point estimation using the mean values of the covariates.

prediction <- predict(model)
mean(prediction)
## [1] 12.5728

3.4. Diagnostics

In this section, the Markov Assumptions will be tested:

Define a function to interpret the tests:


print_interpretation <- function(x){
  if (x$p.value < 0.05) {
    print(paste('reject null hypothesis. pvalue:', x$p.value, sep = " "))
  } else {
    print(paste('failed to reject null hypothesis. pvalue:', x$p.value, sep = " "))
  }
}

  1. Breush-Pagan test for heteroscedasticity:

\[H_{0}:\text{error term is homoscedastic}\]

test_homos <- bptest(model, ~ fitted(model) + I(fitted(model)^2))


print_interpretation(test_homos)
## [1] "reject null hypothesis. pvalue: 5.05351855997932e-83"
plot(model$residuals, 
     xlab = 'unit_{i}', 
     ylab = 'residuals',
     main = 'Residua-Plot')

  1. Test for normality (Jarque Bera):

\[H_{0}:\text{error term is normally distributed}\]

test_jb <- jarque.bera.test(model$residual)

print_interpretation(test_jb)
## [1] "reject null hypothesis. pvalue: 0"
hist(model$residual, breaks = 100,
     main = 'Residua-Distribution',
     xlab = 'observed wage - estimated wage')

3. Ramsey Reset Test for linearity of the data (functional form).

\[H_{0}:\text{The functional form is correct}\]

test_reset <- reset(model, 2:3)
print_interpretation(test_reset)
## [1] "reject null hypothesis. pvalue: 1.56956962172245e-60"
  1. Durbin Watson Test for Autocorrelation:

\[H_{0}:\text{No correlation in the residua}\]

test_dw <- dwtest(formula = model, alternative = 'two.sided')
print_interpretation(test_dw)
## [1] "failed to reject null hypothesis. pvalue: 0.599910543036602"


  1. For those who are not familar with the structure of a python dictionary, it resembles a dataframe in R with row-names, where the key can be interpreted as a row and the values as the entry of the table. The general format reads {key: value}.↩︎

  2. Notice that the city name was retrieved from the url. By looking at the url displayed in the output of Listing 1.2, the reader will find the city into ’es/es/comprar/vivienda/sevilla-capital/….↩︎